[gentoo-amd64] Re: not amd64 specific - disk failure - gentoo-amd64

From:	Duncan <1i5t5.duncan@×××.net>
To:	gentoo-amd64@l.g.o
Subject:	[gentoo-amd64] Re: not amd64 specific - disk failure
Date:	Tue, 20 Nov 2007 09:51:07
Message-Id:	`pan.2007.11.20.09.41.54@cox.net`
In Reply to:	Re: [gentoo-amd64] Re: not amd64 specific - disk failure by Raffaele BELARDI

1

Raffaele BELARDI <raffaele.belardi@××.com> posted 47429114.5030102@××.com,

2

excerpted below, on  Tue, 20 Nov 2007 08:47:32 +0100:

3

4

> So my hypothesis is that the bad blocks or sectors at the beginning of

5

> the partition were not copied, or only partly copied, by dd, and due to

6

> this the superblocks are all shifted down. Although I don't like to

7

> access again the hw, maybe I should try: # dd conv=noerror,sync bs=4096

8

> if=/dev/hdb of=/mnt/disk_500/sdb.img

9

>

10

> to get an aligned image. Problem is I don't know what bs= should be.

11

> Block size, so 4k?

12

>

13

> Any other option I might have?

14

15

This sounds reasonable.  I run reiserfs here and don't know a whole lot 

16

about ext2/3/4, so won't even attempt an opinion at that level of 

17

detail.  (That's why I left the actual recovery procedure after creating 

18

the copy to work with so vague... I wasn't going to try to go there.)

19

20

However, I can say this.  Based on my experience with recovery on 

21

reiserfs (and in fact reiserfs and dd-rescue recovery notes, so it's not 

22

just me), the block-size doesn't necessarily have to match, as it does 

23

copy over "raw", so the data it gets it gets, and the data it doesn't, 

24

well...  It keeps it in the same order serially, as well, so that's not 

25

an issue.  What the block-size DOES affect is how much data is operated 

26

on at once -- when it reaches bad blocks, that's the unit that's going to 

27

determine the amount of missing data.

28

29

Working on a good disk, a relatively large block size (as long as it can 

30

be buffered in memory) is often more efficient, that is, faster, because 

31

the big blocks mean lower processing overhead.  On a partially bad disk, 

32

larger blocks will still allow it to cover the good area faster (but 

33

that's trivial time anyway, compared to the time trying to access the bad 

34

blocks), AND because the block size is larger, it SHOULD mean less bad 

35

blocks to try and try and try before giving up in the bad areas too, so 

36

faster there as well.

37

38

The flip side to the faster access over the bad areas is that as I said, 

39

that's the chunk size that's declared bad, so the larger the block size 

40

you choose, the more potentially recoverable data gets declared bad when 

41

the entire block is declared bad.

42

43

As for working off the bad disk vs working off an image of it, as long as 

44

you can continue to recover data off the bad disk, you can keep trying to 

45

use it.  The problem, of course, is that every access might be your last, 

46

and it's also possible that each time thru may lose a few more blocks of 

47

data at the margin.

48

49

So it's up to you.  The aligned image will certainly be easier to work 

50

with, but you might not be able to get the same amount of valid data off.

51

52

... You never mentioned exactly what happened to the disk.  Mine was 

53

overheating.  I live in Phoenix, AZ, and my AC went out in the middle of 

54

the summer, with me gone and the computer left running.  With outside 

55

temps often reaching close to 50 C (122 F), the temps inside with the AC 

56

off could have easily reached 60 C (140 F).  Ambient case air temps could 

57

therefore have reached 70 C, and with the drive spinning in that... one 

58

can only guess what temps it reached!

59

60

Well, rather obviously, the platters expanded and the heads crashed, 

61

grooving out a circle in the platter at whatever location they were at at 

62

the time, plus wherever the still operating system told the heads to seek 

63

to.  However, once I came home and realized what had happened, I shut 

64

down and let everything cool down.  After replacing the AC, with 

65

everything running normal temps again, I was able to boot back up.

66

67

I ended up with two separate heavily damaged areas in which I could 

68

recover little if anything, but fortunately, the partition table and 

69

superblocks were intact.  I also had been running backup partition copies 

70

of most of my valuable stuff, by partition, and was able to recover most 

71

of it from that (barring the new stuff since my last backup, which was 

72

longer ago then it should have been), since they had been unmounted at 

73

the time and therefore didn't have the heads seeking into them, only 

74

across them a few times.

75

76

Actually, perhaps surprisingly, I was able to run those disks for some 

77

time without any known additional damage.  I did switch disks as soon as 

78

possible, because I was leery of continuing to depend on the partially 

79

bad ones, but in the mean time, I just checked off the affected 

80

partitions as dead, and continued to use the others without issue.  In 

81

fact, I still have the disk, and might still be using it for extra 

82

storage, except that was the second disk I had lost in two years (looking 

83

back, the one I'd lost the previous year was probably heat related as 

84

well, as it had the same failure pattern, and the AC wasn't doing so well 

85

even then), and I decided to switch to RAID and go slower speed but 

86

longer warrantee (5 yr) Seagate drives.  Those are now going into their 

87

third year, without issue (and with a new AC with cooling capacity to 

88

spare, so hopefully it'll be several years before I need to worry about 

89

/that/ issue again), but at least now I have the RAID backing me up, with 

90

most of the system on kernel/md RAID-6, so I can lose up to two of the 

91

four drives and maintain data integrity.  I am, however, already thinking 

92

about how I'll do it better next time, now that I've a bit of RAID 

93

experience under my belt. =8^)

94

95

So anyway, if it was heat related, chances are pretty decent it'll remain 

96

relatively stable, no additional data loss, as long as you keep pretty 

97

strict watch on the temps and don't let it overheat again.  That was my 

98

experience this last time, when I know it was heat related, and the time 

99

before, which had the same failure pattern, so I'm guessing it was heat 

100

related.  Of course, you never can tell, but that has been my experience 

101

with heat related disk failures, anyway.

102

103

--

104

Duncan - List replies preferred.   No HTML msgs.

105

"Every nonfree program has a lord, a master --

106

and if you use the program, he is your master."  Richard Stallman

107

108

--

109

gentoo-amd64@g.o mailing list

Gentoo Archives: gentoo-amd64

Replies

1	Raffaele BELARDI <raffaele.belardi@××.com> posted 47429114.5030102@××.com,
2	excerpted below, on Tue, 20 Nov 2007 08:47:32 +0100:
3
4	> So my hypothesis is that the bad blocks or sectors at the beginning of
5	> the partition were not copied, or only partly copied, by dd, and due to
6	> this the superblocks are all shifted down. Although I don't like to
7	> access again the hw, maybe I should try: # dd conv=noerror,sync bs=4096
8	> if=/dev/hdb of=/mnt/disk_500/sdb.img
9	>
10	> to get an aligned image. Problem is I don't know what bs= should be.
11	> Block size, so 4k?
12	>
13	> Any other option I might have?
14
15	This sounds reasonable. I run reiserfs here and don't know a whole lot
16	about ext2/3/4, so won't even attempt an opinion at that level of
17	detail. (That's why I left the actual recovery procedure after creating
18	the copy to work with so vague... I wasn't going to try to go there.)
19
20	However, I can say this. Based on my experience with recovery on
21	reiserfs (and in fact reiserfs and dd-rescue recovery notes, so it's not
22	just me), the block-size doesn't necessarily have to match, as it does
23	copy over "raw", so the data it gets it gets, and the data it doesn't,
24	well... It keeps it in the same order serially, as well, so that's not
25	an issue. What the block-size DOES affect is how much data is operated
26	on at once -- when it reaches bad blocks, that's the unit that's going to
27	determine the amount of missing data.
28
29	Working on a good disk, a relatively large block size (as long as it can
30	be buffered in memory) is often more efficient, that is, faster, because
31	the big blocks mean lower processing overhead. On a partially bad disk,
32	larger blocks will still allow it to cover the good area faster (but
33	that's trivial time anyway, compared to the time trying to access the bad
34	blocks), AND because the block size is larger, it SHOULD mean less bad
35	blocks to try and try and try before giving up in the bad areas too, so
36	faster there as well.
37
38	The flip side to the faster access over the bad areas is that as I said,
39	that's the chunk size that's declared bad, so the larger the block size
40	you choose, the more potentially recoverable data gets declared bad when
41	the entire block is declared bad.
42
43	As for working off the bad disk vs working off an image of it, as long as
44	you can continue to recover data off the bad disk, you can keep trying to
45	use it. The problem, of course, is that every access might be your last,
46	and it's also possible that each time thru may lose a few more blocks of
47	data at the margin.
48
49	So it's up to you. The aligned image will certainly be easier to work
50	with, but you might not be able to get the same amount of valid data off.
51
52	... You never mentioned exactly what happened to the disk. Mine was
53	overheating. I live in Phoenix, AZ, and my AC went out in the middle of
54	the summer, with me gone and the computer left running. With outside
55	temps often reaching close to 50 C (122 F), the temps inside with the AC
56	off could have easily reached 60 C (140 F). Ambient case air temps could
57	therefore have reached 70 C, and with the drive spinning in that... one
58	can only guess what temps it reached!
59
60	Well, rather obviously, the platters expanded and the heads crashed,
61	grooving out a circle in the platter at whatever location they were at at
62	the time, plus wherever the still operating system told the heads to seek
63	to. However, once I came home and realized what had happened, I shut
64	down and let everything cool down. After replacing the AC, with
65	everything running normal temps again, I was able to boot back up.
66
67	I ended up with two separate heavily damaged areas in which I could
68	recover little if anything, but fortunately, the partition table and
69	superblocks were intact. I also had been running backup partition copies
70	of most of my valuable stuff, by partition, and was able to recover most
71	of it from that (barring the new stuff since my last backup, which was
72	longer ago then it should have been), since they had been unmounted at
73	the time and therefore didn't have the heads seeking into them, only
74	across them a few times.
75
76	Actually, perhaps surprisingly, I was able to run those disks for some
77	time without any known additional damage. I did switch disks as soon as
78	possible, because I was leery of continuing to depend on the partially
79	bad ones, but in the mean time, I just checked off the affected
80	partitions as dead, and continued to use the others without issue. In
81	fact, I still have the disk, and might still be using it for extra
82	storage, except that was the second disk I had lost in two years (looking
83	back, the one I'd lost the previous year was probably heat related as
84	well, as it had the same failure pattern, and the AC wasn't doing so well
85	even then), and I decided to switch to RAID and go slower speed but
86	longer warrantee (5 yr) Seagate drives. Those are now going into their
87	third year, without issue (and with a new AC with cooling capacity to
88	spare, so hopefully it'll be several years before I need to worry about
89	/that/ issue again), but at least now I have the RAID backing me up, with
90	most of the system on kernel/md RAID-6, so I can lose up to two of the
91	four drives and maintain data integrity. I am, however, already thinking
92	about how I'll do it better next time, now that I've a bit of RAID
93	experience under my belt. =8^)
94
95	So anyway, if it was heat related, chances are pretty decent it'll remain
96	relatively stable, no additional data loss, as long as you keep pretty
97	strict watch on the temps and don't let it overheat again. That was my
98	experience this last time, when I know it was heat related, and the time
99	before, which had the same failure pattern, so I'm guessing it was heat
100	related. Of course, you never can tell, but that has been my experience
101	with heat related disk failures, anyway.
102
103	--
104	Duncan - List replies preferred. No HTML msgs.
105	"Every nonfree program has a lord, a master --
106	and if you use the program, he is your master." Richard Stallman
107
108	--
109	gentoo-amd64@g.o mailing list