Gentoo Archives: gentoo-amd64

From: Duncan <1i5t5.duncan@×××.net>
To: gentoo-amd64@l.g.o
Subject: [gentoo-amd64] Re: not amd64 specific - disk failure
Date: Tue, 20 Nov 2007 09:51:07
Message-Id: pan.2007.11.20.09.41.54@cox.net
In Reply to: Re: [gentoo-amd64] Re: not amd64 specific - disk failure by Raffaele BELARDI
1 Raffaele BELARDI <raffaele.belardi@××.com> posted 47429114.5030102@××.com,
2 excerpted below, on Tue, 20 Nov 2007 08:47:32 +0100:
3
4 > So my hypothesis is that the bad blocks or sectors at the beginning of
5 > the partition were not copied, or only partly copied, by dd, and due to
6 > this the superblocks are all shifted down. Although I don't like to
7 > access again the hw, maybe I should try: # dd conv=noerror,sync bs=4096
8 > if=/dev/hdb of=/mnt/disk_500/sdb.img
9 >
10 > to get an aligned image. Problem is I don't know what bs= should be.
11 > Block size, so 4k?
12 >
13 > Any other option I might have?
14
15 This sounds reasonable. I run reiserfs here and don't know a whole lot
16 about ext2/3/4, so won't even attempt an opinion at that level of
17 detail. (That's why I left the actual recovery procedure after creating
18 the copy to work with so vague... I wasn't going to try to go there.)
19
20 However, I can say this. Based on my experience with recovery on
21 reiserfs (and in fact reiserfs and dd-rescue recovery notes, so it's not
22 just me), the block-size doesn't necessarily have to match, as it does
23 copy over "raw", so the data it gets it gets, and the data it doesn't,
24 well... It keeps it in the same order serially, as well, so that's not
25 an issue. What the block-size DOES affect is how much data is operated
26 on at once -- when it reaches bad blocks, that's the unit that's going to
27 determine the amount of missing data.
28
29 Working on a good disk, a relatively large block size (as long as it can
30 be buffered in memory) is often more efficient, that is, faster, because
31 the big blocks mean lower processing overhead. On a partially bad disk,
32 larger blocks will still allow it to cover the good area faster (but
33 that's trivial time anyway, compared to the time trying to access the bad
34 blocks), AND because the block size is larger, it SHOULD mean less bad
35 blocks to try and try and try before giving up in the bad areas too, so
36 faster there as well.
37
38 The flip side to the faster access over the bad areas is that as I said,
39 that's the chunk size that's declared bad, so the larger the block size
40 you choose, the more potentially recoverable data gets declared bad when
41 the entire block is declared bad.
42
43 As for working off the bad disk vs working off an image of it, as long as
44 you can continue to recover data off the bad disk, you can keep trying to
45 use it. The problem, of course, is that every access might be your last,
46 and it's also possible that each time thru may lose a few more blocks of
47 data at the margin.
48
49 So it's up to you. The aligned image will certainly be easier to work
50 with, but you might not be able to get the same amount of valid data off.
51
52 ... You never mentioned exactly what happened to the disk. Mine was
53 overheating. I live in Phoenix, AZ, and my AC went out in the middle of
54 the summer, with me gone and the computer left running. With outside
55 temps often reaching close to 50 C (122 F), the temps inside with the AC
56 off could have easily reached 60 C (140 F). Ambient case air temps could
57 therefore have reached 70 C, and with the drive spinning in that... one
58 can only guess what temps it reached!
59
60 Well, rather obviously, the platters expanded and the heads crashed,
61 grooving out a circle in the platter at whatever location they were at at
62 the time, plus wherever the still operating system told the heads to seek
63 to. However, once I came home and realized what had happened, I shut
64 down and let everything cool down. After replacing the AC, with
65 everything running normal temps again, I was able to boot back up.
66
67 I ended up with two separate heavily damaged areas in which I could
68 recover little if anything, but fortunately, the partition table and
69 superblocks were intact. I also had been running backup partition copies
70 of most of my valuable stuff, by partition, and was able to recover most
71 of it from that (barring the new stuff since my last backup, which was
72 longer ago then it should have been), since they had been unmounted at
73 the time and therefore didn't have the heads seeking into them, only
74 across them a few times.
75
76 Actually, perhaps surprisingly, I was able to run those disks for some
77 time without any known additional damage. I did switch disks as soon as
78 possible, because I was leery of continuing to depend on the partially
79 bad ones, but in the mean time, I just checked off the affected
80 partitions as dead, and continued to use the others without issue. In
81 fact, I still have the disk, and might still be using it for extra
82 storage, except that was the second disk I had lost in two years (looking
83 back, the one I'd lost the previous year was probably heat related as
84 well, as it had the same failure pattern, and the AC wasn't doing so well
85 even then), and I decided to switch to RAID and go slower speed but
86 longer warrantee (5 yr) Seagate drives. Those are now going into their
87 third year, without issue (and with a new AC with cooling capacity to
88 spare, so hopefully it'll be several years before I need to worry about
89 /that/ issue again), but at least now I have the RAID backing me up, with
90 most of the system on kernel/md RAID-6, so I can lose up to two of the
91 four drives and maintain data integrity. I am, however, already thinking
92 about how I'll do it better next time, now that I've a bit of RAID
93 experience under my belt. =8^)
94
95 So anyway, if it was heat related, chances are pretty decent it'll remain
96 relatively stable, no additional data loss, as long as you keep pretty
97 strict watch on the temps and don't let it overheat again. That was my
98 experience this last time, when I know it was heat related, and the time
99 before, which had the same failure pattern, so I'm guessing it was heat
100 related. Of course, you never can tell, but that has been my experience
101 with heat related disk failures, anyway.
102
103 --
104 Duncan - List replies preferred. No HTML msgs.
105 "Every nonfree program has a lord, a master --
106 and if you use the program, he is your master." Richard Stallman
107
108 --
109 gentoo-amd64@g.o mailing list

Replies

Subject Author
Re: [gentoo-amd64] Re: not amd64 specific - disk failure Raffaele BELARDI <raffaele.belardi@××.com>