Gentoo Archives: gentoo-amd64

From: Duncan <1i5t5.duncan@×××.net>
To: gentoo-amd64@l.g.o
Subject: [gentoo-amd64] Re: fsck seems to screw up my harddisk
Date: Sun, 03 Dec 2006 17:06:52
Message-Id: ekv037$npb$1@sea.gmane.org
In Reply to: Re: [gentoo-amd64] fsck seems to screw up my harddisk by Guido Doornberg
1 "Guido Doornberg" <guidodoornberg@×××××.com> posted
2 eb2db630612030540k38b658b2q3571a712efc510d0@××××××××××.com, excerpted
3 below, on Sun, 03 Dec 2006 14:40:11 +0100:
4
5 > Well, I downloaded and started a fresh 2006.1 livecd, repartitioned de
6 > hdd, started mke2fs and this time with the -c option.
7 >
8 > So, it started checking and after about 15 minutes this kept on showing up
9 > on my screen:
10 >
11 > ata1: error=ox40 {uncorrectable error} ata1: translated ATA stat/err
12 > 0X51/40 SCSI SK/ASC/ASCQ 0x3/11/04
13 >
14 > after a while i got a couple of other messages, and now it keeps on
15 > talking about Buffer I/O error on device sda3, and after that various
16 > sectors and blocks are called.
17 >
18 > I did look after my power supply and I'm for 99% sure that's not the
19 > problem. So, correct me if i'm wrong but that would mean my harddisk is
20 > the problem? But how than is it possible that I can use it normally if I
21 > don't let fsck check it?
22 >
23 > I know this isn't really gentoo specific anymore, but if anyone knows what
24 > to do i'm happy to hear it.
25
26 Your suspiciouns seem correct to me as well.
27
28 I've had several hard drives go partially bad over the last several years.
29 The last one I know was due to heat as I'm in Phoenix, AZ, with highs in
30 the summer approaching 50 C (122 F), and my AC went out. Since it
31 followed the same basic pattern of another one previous to that, I expect
32 the problem with the previous one was heat as well tho I'm not positive.
33
34 What happens when the drives overheat is the platters expand and the heads
35 crash into them, thereby digging grooves (which I could see taking the
36 drive apart later) in the platters. Of course, the data will be destroyed
37 at for those disk cylinders, basically wherever the head seeked to
38 while the platter was hot enough to crash it, but the rest of the drive is
39 recoverable, and from my experience, somewhat stable, provided the drive
40 doesn't overheat again. Due to the way I have my system setup (see below)
41 and what was damaged, I was actually able to continue to use the system
42 for some time. Never-the-less, get anything you want saved off it ASAP,
43 preferably leaving it shut off until you can, just in case, after which
44 you can work around the problem if you wish, marking badblocks, and use
45 the disk for either temp stuff only or always backed up stuff, from then
46 on.
47
48 It's possible for particularly drives used for mobile applications to have
49 similar head-crashes, due to dropping the laptop or whatever, and there
50 may be other ways to generate that pattern as well.
51
52 How to work around the issue? First, as I said, backup the disk, or at
53 least anything of value on it. Of course, this likely won't apply since
54 you were setting up a new system on it anyway, but for completeness... If
55 you run into areas that won't easily copy, and you want to recover the
56 data if possible, there's a package, sys-apps/dd-rescue, or it should be
57 available on any good recovery LiveCD. (I doubt it's on the Gentoo
58 install CDs, but you can check). dd_rescue is the same idea as the normal
59 Unix dd utility, but the rescue version is designed to try from the
60 beginning of a partition forward until it runs into problems, then from
61 the end backward, then in the middle of anything still left unread, until
62 it has copied as much of the partition as possible. You can then fsck the
63 recovered copy and see what can be repaired. Note, however, that this is
64 a process that will take awhile, hours, possibly days, depending on how
65 much of the disk is damaged, as the drive tries several times to read the
66 data, and if it fails the software will have it try /again/ several times.
67 Depending on your i/o system, you aren't likely to be able to do much
68 else with the system while this is going on, as it'll tend to lock things
69 up pretty badly during the try and fail and try again phase. Of course,
70 this will be repeated for each bad block, so it WILL take awhile if more
71 than a handful of blocks are damaged. Recovery of all the data is
72 obviously not guaranteed in any case, and you may simply decide it's not
73 worth the hassle. Google or see the dd_rescue manpage for details.
74
75 It should be noted that dd_rescue can be configured to report the
76 badblocks as it goes, so you can skip the badblocks mapping step below if
77 you use it to recover existing data, and save its badblocks report to be
78 reused later.
79
80 If you skip data recovery attempts, or simply want to test any disk before
81 you use it, you'll want another app, badblocks, likely installed already as
82 a part of sys-fs/e2fsprogs. badblocks can scan the disk in either
83 (non-destructive) read/read-over/compare mode, non-destructive
84 read/write-back/read-back/compare mode, or destructive
85 write-pattern/read-back/compare mode. Do NOT use the destructive mode if
86 there's stuff on the partition you want to keep, as it WILL overwrite it.
87
88 However you generate the badblocks report, using either the output of
89 dd_rescue or badblocks, you then use this information when setting up your
90 disk again. It's probably wise to setup multiple partitions, leaving the
91 large bad areas unpartitioned. For smaller bad areas of just a handful of
92 blocks, one of the parameters you can feed mkfs is a badblocks list.
93 Again, check the manpages or google for the details, but when you are
94 done, you should be left with a working and fsck-able set of partitions
95 once again, since the badblocks are either excluded from the area you
96 partitioned, or listed as badblocks in the superblock area of the
97 filesystem you created using that parameter with your mkfs, and therefore
98 avoided.
99
100 ---
101 * For reliability purposes, I had my system setup with multiple copies of
102 most of my partitions. The idea was periodically, when the system seemed
103 stable, I'd backup my main working copy of all the critical partitions,
104 and could therefore boot a not-too-old backup copy in the event something
105 broke on my main working copy. Basically, all it took (and all it
106 continues to take) is appending a different root= parameter to the kernel
107 command line, to boot the rootmirror. Thus, when portions of the drive
108 were damaged, they were naturally the portions the head had tried to seek
109 to during the time the drive was overheated, which means they were in the
110 partitions mounted at the time. The unmounted partitions were therefore
111 undamaged and after finding the system crashed due to the overheating,
112 once I cooled things back down, I could boot to the backup partitions and
113 resume from there. As it happened, only a couple of my working partitions
114 were damaged, and I was able to use the working copy of all the other
115 partitions.
116
117 In terms of partitioning strategy... with my old system I made the mistake
118 of separating /var and /usr onto their own partitions, and then trying to
119 mix and match backup partitions with working copy partitions. That didn't
120 work so well, because the portage records of what were installed were from
121 the backup and therefore outdated /var partition, while /usr and root were
122 the working copies, so portage had the wrong package versions as being
123 installed. Since I had use FEATURES=buildpkg and had all the packages
124 available in binary format, it was easy to simply reinstall everything
125 from them, updating the portage database, but because it wasn't accurate,
126 it couldn't unmerge the non-existing old versions, so I ended up with a
127 bunch of stale and orphaned files strewn around.
128
129 When I upgraded from that disk, which I did as soon as I could since I
130 didn't trust it even tho it was working, I therefore setup things a bit
131 differently. What I'd suggest today would be keeping /var and /usr on
132 your root partition, but putting /var/log and /var/tmp and /usr/portage
133 and /usr/src, as well as stuff such as /home, on on other partitions.
134 (You can use one and either use mount --move or simply symlink, if you
135 want to put several dirs from different places in the tree on the same
136 partition.)
137
138 Basically, anything that portage installs stuff to, along with its
139 database in /var/db, should be kept on the same partition, so every backup
140 of that partition will have the portage database in sync with what's
141 actually installed, since it's the same partition.
142
143 Here, my / partition and backup snapshots are 10 GB each. That's plenty
144 of room to spare for me, since less than two GB are actually used. I'd
145 recommend a total of three copies of it, the working or main copy, and two
146 snapshot backups of the same exact partition size. The idea being that
147 you can alternate backups, so even if something happens after you've
148 erased the one backup in preparation for copying over the working system
149 as a new snapshot, so that backup is erased or incomplete at the
150 same time the working copy dies, you'll still have the other backup to fall
151 back to.
152
153 Similarly with partitions such as /home and /usr/local that hold data I
154 want to be sure and keep. 2-3 copies of each, a working and 1-2 backup
155 copies. /var/log you probably don't need a copy of. Same with wherever
156 you have your portage tree, since you can always just sync it to get
157 another if it's destroyed, and with /tmp and /var/tmp, since that's temp
158 data anyway and doesn't need a redundant copy kept.
159
160 Actually, while that can be implemented well on one or two disks, here, I
161 got tired of hard drive problems, and I'm now running a four-disk kernel
162 based SATA RAID, Seagate drives, 5-yr warrantee, altho they aren't quite
163 as fast as some of the others you can buy. Booting requires RAID-1 so I
164 have a small RAID-1 partition mirrored across all four drives. That's
165 /boot. Most of my system is RAID-6, which in a four-way system is
166 effectively a two-way stripe with two parity stripes as well. Thus, I can
167 lose any two of the four drives and anything on the RAID-6 will still be
168 recoverable. Stuff like /tmp, the portage tree, etc, that's either easily
169 redownloaded off the net or is temporary anyway, is on a 4-way RAID-0 for
170 speed. If any of the four drives goes down, all that data is lost, but
171 that's fine, since it's either temporary or easily recovered anyway.
172 Likewise, my swap is four-way striped. Disk read/write speed on this
173 four-way striped area is incredibly fast (for hard drive access), since
174 drives are so much slower than the bus connecting them to the system,
175 meaning the system can keep the bus busy doing i/o to all four devices
176 instead of just to one, and then having to wait for the slow drive. The
177 problem with RAID-0, however, is that while it's far faster, it's also far
178 riskier, since you lose it if you lose any of the component devices.
179 Fortunately, the data that is easiest to replace is also generally the
180 most speed critical, so it works out quite well. =8^) I have RAID-1
181 mirrored for /boot, RAID-6 for safety for most of my system, and RAID-0 for
182 speed where I don't care if the data dies. On top of that, for the parts
183 of the system I really care about, I keep several snapshots around on the
184 RAID-6, thus protecting me both from fat-finger syndrome deletions (where
185 RAID won't help, unfortunately) with the multiple snapshots, and from
186 device failure with the RAID-6. As an added bonus, since I'm running
187 kernel-RAID, it's not hardware specific, so if the SATA chip dies, all I
188 have to do is buy a new 4-way SATA board, plug the existing drives into
189 the new board, and compile a new kernel (from a liveCD or whatever) with
190 the appropriate new SATA drivers, and I'm up and running again. If I had
191 gone hardware RAID and it died, I'd have to get another one like it to
192 plug into, if I wanted to recover my data, something I don't have to worry
193 about with kernel-raid. =8^)
194
195 --
196 Duncan - List replies preferred. No HTML msgs.
197 "Every nonfree program has a lord, a master --
198 and if you use the program, he is your master." Richard Stallman
199
200 --
201 gentoo-amd64@g.o mailing list

Replies

Subject Author
Re: [gentoo-amd64] Re: fsck seems to screw up my harddisk Drake Donahue <donahue95@×××××××.net>