1 |
Raffaele BELARDI <raffaele.belardi@××.com> posted 47429114.5030102@××.com, |
2 |
excerpted below, on Tue, 20 Nov 2007 08:47:32 +0100: |
3 |
|
4 |
> So my hypothesis is that the bad blocks or sectors at the beginning of |
5 |
> the partition were not copied, or only partly copied, by dd, and due to |
6 |
> this the superblocks are all shifted down. Although I don't like to |
7 |
> access again the hw, maybe I should try: # dd conv=noerror,sync bs=4096 |
8 |
> if=/dev/hdb of=/mnt/disk_500/sdb.img |
9 |
> |
10 |
> to get an aligned image. Problem is I don't know what bs= should be. |
11 |
> Block size, so 4k? |
12 |
> |
13 |
> Any other option I might have? |
14 |
|
15 |
This sounds reasonable. I run reiserfs here and don't know a whole lot |
16 |
about ext2/3/4, so won't even attempt an opinion at that level of |
17 |
detail. (That's why I left the actual recovery procedure after creating |
18 |
the copy to work with so vague... I wasn't going to try to go there.) |
19 |
|
20 |
However, I can say this. Based on my experience with recovery on |
21 |
reiserfs (and in fact reiserfs and dd-rescue recovery notes, so it's not |
22 |
just me), the block-size doesn't necessarily have to match, as it does |
23 |
copy over "raw", so the data it gets it gets, and the data it doesn't, |
24 |
well... It keeps it in the same order serially, as well, so that's not |
25 |
an issue. What the block-size DOES affect is how much data is operated |
26 |
on at once -- when it reaches bad blocks, that's the unit that's going to |
27 |
determine the amount of missing data. |
28 |
|
29 |
Working on a good disk, a relatively large block size (as long as it can |
30 |
be buffered in memory) is often more efficient, that is, faster, because |
31 |
the big blocks mean lower processing overhead. On a partially bad disk, |
32 |
larger blocks will still allow it to cover the good area faster (but |
33 |
that's trivial time anyway, compared to the time trying to access the bad |
34 |
blocks), AND because the block size is larger, it SHOULD mean less bad |
35 |
blocks to try and try and try before giving up in the bad areas too, so |
36 |
faster there as well. |
37 |
|
38 |
The flip side to the faster access over the bad areas is that as I said, |
39 |
that's the chunk size that's declared bad, so the larger the block size |
40 |
you choose, the more potentially recoverable data gets declared bad when |
41 |
the entire block is declared bad. |
42 |
|
43 |
As for working off the bad disk vs working off an image of it, as long as |
44 |
you can continue to recover data off the bad disk, you can keep trying to |
45 |
use it. The problem, of course, is that every access might be your last, |
46 |
and it's also possible that each time thru may lose a few more blocks of |
47 |
data at the margin. |
48 |
|
49 |
So it's up to you. The aligned image will certainly be easier to work |
50 |
with, but you might not be able to get the same amount of valid data off. |
51 |
|
52 |
... You never mentioned exactly what happened to the disk. Mine was |
53 |
overheating. I live in Phoenix, AZ, and my AC went out in the middle of |
54 |
the summer, with me gone and the computer left running. With outside |
55 |
temps often reaching close to 50 C (122 F), the temps inside with the AC |
56 |
off could have easily reached 60 C (140 F). Ambient case air temps could |
57 |
therefore have reached 70 C, and with the drive spinning in that... one |
58 |
can only guess what temps it reached! |
59 |
|
60 |
Well, rather obviously, the platters expanded and the heads crashed, |
61 |
grooving out a circle in the platter at whatever location they were at at |
62 |
the time, plus wherever the still operating system told the heads to seek |
63 |
to. However, once I came home and realized what had happened, I shut |
64 |
down and let everything cool down. After replacing the AC, with |
65 |
everything running normal temps again, I was able to boot back up. |
66 |
|
67 |
I ended up with two separate heavily damaged areas in which I could |
68 |
recover little if anything, but fortunately, the partition table and |
69 |
superblocks were intact. I also had been running backup partition copies |
70 |
of most of my valuable stuff, by partition, and was able to recover most |
71 |
of it from that (barring the new stuff since my last backup, which was |
72 |
longer ago then it should have been), since they had been unmounted at |
73 |
the time and therefore didn't have the heads seeking into them, only |
74 |
across them a few times. |
75 |
|
76 |
Actually, perhaps surprisingly, I was able to run those disks for some |
77 |
time without any known additional damage. I did switch disks as soon as |
78 |
possible, because I was leery of continuing to depend on the partially |
79 |
bad ones, but in the mean time, I just checked off the affected |
80 |
partitions as dead, and continued to use the others without issue. In |
81 |
fact, I still have the disk, and might still be using it for extra |
82 |
storage, except that was the second disk I had lost in two years (looking |
83 |
back, the one I'd lost the previous year was probably heat related as |
84 |
well, as it had the same failure pattern, and the AC wasn't doing so well |
85 |
even then), and I decided to switch to RAID and go slower speed but |
86 |
longer warrantee (5 yr) Seagate drives. Those are now going into their |
87 |
third year, without issue (and with a new AC with cooling capacity to |
88 |
spare, so hopefully it'll be several years before I need to worry about |
89 |
/that/ issue again), but at least now I have the RAID backing me up, with |
90 |
most of the system on kernel/md RAID-6, so I can lose up to two of the |
91 |
four drives and maintain data integrity. I am, however, already thinking |
92 |
about how I'll do it better next time, now that I've a bit of RAID |
93 |
experience under my belt. =8^) |
94 |
|
95 |
So anyway, if it was heat related, chances are pretty decent it'll remain |
96 |
relatively stable, no additional data loss, as long as you keep pretty |
97 |
strict watch on the temps and don't let it overheat again. That was my |
98 |
experience this last time, when I know it was heat related, and the time |
99 |
before, which had the same failure pattern, so I'm guessing it was heat |
100 |
related. Of course, you never can tell, but that has been my experience |
101 |
with heat related disk failures, anyway. |
102 |
|
103 |
-- |
104 |
Duncan - List replies preferred. No HTML msgs. |
105 |
"Every nonfree program has a lord, a master -- |
106 |
and if you use the program, he is your master." Richard Stallman |
107 |
|
108 |
-- |
109 |
gentoo-amd64@g.o mailing list |