Gentoo Archives: gentoo-user

From: Laurence Perkins <lperkins@×××××××.net>
To: "gentoo-user@l.g.o" <gentoo-user@l.g.o>
Subject: RE: [gentoo-user] Hard drive error from SMART
Date: Tue, 12 Apr 2022 17:21:40
Message-Id: MW2PR07MB4058789A23D85453147A7DFFD2ED9@MW2PR07MB4058.namprd07.prod.outlook.com
In Reply to: Re: [gentoo-user] Hard drive error from SMART by Dale
1 > -----Original Message-----
2 > From: Dale <rdalek1967@×××××.com>
3 > Sent: Tuesday, April 12, 2022 10:08 AM
4 > To: gentoo-user@l.g.o
5 > Subject: Re: [gentoo-user] Hard drive error from SMART
6 >
7 > Rich Freeman wrote:
8 > > On Mon, Apr 11, 2022 at 9:27 PM Dale <rdalek1967@×××××.com> wrote:
9 > >> Thoughts. Replace as soon as drive arrives or wait and see?
10 > >>
11 > > So, first of all just about all my hard drives are in a RAID at this
12 > > point, so I have a higher tolerance for issues.
13 > >
14 > > If a drive is under warranty I'll usually try to see if they will RMA
15 > > it. More often than not they will, and in that case there is really
16 > > no reason not to. I'll do advance shipping and replace the drive
17 > > before sending the old one back so that I mostly have redundancy the
18 > > whole time.
19 > >
20 > > If it isn't under warranty then I'll scrub it and see what happens.
21 > > I'll of course do SMART self-tests, but usually an error like this
22 > > won't actually clear until you overwrite the offline sector so that
23 > > the drive can reallocate it. A RAID scrub/resilver/etc will overwrite
24 > > the sector with the correct contents which will allow this to happen.
25 > > (Otherwise there is no way for the drive to recover - if it knew what
26 > > was stored there it wouldn't have an error in the first place.)
27 > >
28 > > If an error comes back then I'll replace the drive. My drives are
29 > > pretty large at this point so I don't like keeping unreliable drives
30 > > around. It just increases the risk of double failures, given that a
31 > > large hard drive can take more than a day to replace. Write speeds
32 > > just don't keep pace with capacities. I do have offline backups but I
33 > > shudder at the thought of how long one of those would take to restore.
34 > >
35 >
36 >
37 > Sadly, I don't have RAID here but to be honest, I really need to have it given the data and my recent luck with hard drives. Drives used to get dumped because they were just to small to use anymore. Nowadays, they seem to break in some fashion long before their usefulness ends their lives.
38 >
39 > I remounted the drives and did a backup. For anyone running up on this, just in case one of the files got corrupted, I used a little trick to see if I can figure out which one may be bad if any. I took my rsync commands from my little script and ran them one at a time with --dry-run added. If a file was to be updated on the backup that I hadn't changed or added, I was going to check into it before updating my backups. It could be that the backup file was still good and the file on my drive reporting problems was bad. In that case, I would determine which was good and either restore it from backups or allow it to be updated if needed. Either way, I should have a good file since the drive claims to have fixed the problem. Now let us pray. :-D
40 >
41 > Drive isn't under warranty. I may have to start buying new drives from dealers. Sometimes I find drives that are pulled from systems and have very few hours on them. Still, warranty may not last long. Saves a lot of money tho.
42 >
43 > USPS claims drive is on the way. Left a distribution point and should update again when it gets close. First said Saturday, then said Friday. I think Friday is about right but if the wind blows right, maybe Thursday.
44 >
45 > I hope I have another port and power cable plug for the swap out. At least now, I can unmount it and swap without a lot of rebooting. Since it's on LVM, that part is easy. Regretfully I have experience on that process. :/
46 >
47 > Thanks to all.
48 >
49 > Dale
50 >
51 > :-) :-)
52 >
53 >
54 You can get up to 16X SATA PCI-e cards these days for pretty cheap. So as long as you have the power to run another drive or two there's not much reason not to do RAID on the important stuff. Also, the SATA protocol allows for port expanders, which are also pretty cheap.
55
56 One of my favorite things about BTRFS is the data checksums. If the drive returns garbage, it turns into a read error. Also, if you can't do real RAID, but have excess space you can tell it to keep two copies of everything. Doesn't help with total drive failure, but does protect against the occasional failed sector. If you don't mind writes taking twice as long anyway.
57
58 LMP

Replies

Subject Author
Re: [gentoo-user] Hard drive error from SMART Dale <rdalek1967@×××××.com>
Re: [gentoo-user] Hard drive error from SMART Wols Lists <antlists@××××××××××××.uk>