Gentoo Archives: gentoo-user

From: Laurence Perkins <lperkins@×××××××.net>
To: "gentoo-user@l.g.o" <gentoo-user@l.g.o>
Subject: RE: [gentoo-user] Hard drive error from SMART
Date: Tue, 12 Apr 2022 19:41:18
Message-Id: MW2PR07MB4058E8318890220AF1956C40D2ED9@MW2PR07MB4058.namprd07.prod.outlook.com
In Reply to: Re: [gentoo-user] Hard drive error from SMART by Dale
1 >-----Original Message-----
2 >From: Dale <rdalek1967@×××××.com>
3 >Sent: Tuesday, April 12, 2022 11:22 AM
4 >To: gentoo-user@l.g.o
5 >Subject: Re: [gentoo-user] Hard drive error from SMART
6 >
7 >Laurence Perkins wrote:
8 >>> -----Original Message-----
9 >>> From: Dale <rdalek1967@×××××.com>
10 >>> Sent: Tuesday, April 12, 2022 10:08 AM
11 >>> To: gentoo-user@l.g.o
12 >>> Subject: Re: [gentoo-user] Hard drive error from SMART
13 >>>
14 >>> Rich Freeman wrote:
15 >>>> On Mon, Apr 11, 2022 at 9:27 PM Dale <rdalek1967@×××××.com> wrote:
16 >>>>> Thoughts. Replace as soon as drive arrives or wait and see?
17 >>>>>
18 >>>> So, first of all just about all my hard drives are in a RAID at this
19 >>>> point, so I have a higher tolerance for issues.
20 >>>>
21 >>>> If a drive is under warranty I'll usually try to see if they will
22 >>>> RMA it. More often than not they will, and in that case there is
23 >>>> really no reason not to. I'll do advance shipping and replace the
24 >>>> drive before sending the old one back so that I mostly have
25 >>>> redundancy the whole time.
26 >>>>
27 >>>> If it isn't under warranty then I'll scrub it and see what happens.
28 >>>> I'll of course do SMART self-tests, but usually an error like this
29 >>>> won't actually clear until you overwrite the offline sector so that
30 >>>> the drive can reallocate it. A RAID scrub/resilver/etc will
31 >>>> overwrite the sector with the correct contents which will allow this to happen.
32 >>>> (Otherwise there is no way for the drive to recover - if it knew
33 >>>> what was stored there it wouldn't have an error in the first place.)
34 >>>>
35 >>>> If an error comes back then I'll replace the drive. My drives are
36 >>>> pretty large at this point so I don't like keeping unreliable drives
37 >>>> around. It just increases the risk of double failures, given that a
38 >>>> large hard drive can take more than a day to replace. Write speeds
39 >>>> just don't keep pace with capacities. I do have offline backups but
40 >>>> I shudder at the thought of how long one of those would take to restore.
41 >>>>
42 >>>
43 >>> Sadly, I don't have RAID here but to be honest, I really need to have it given the data and my recent luck with hard drives. Drives used to get dumped because they were just to small to use anymore. Nowadays, they seem to break in some fashion long before their usefulness ends their lives.
44 >>>
45 >>> I remounted the drives and did a backup. For anyone running up on
46 >>> this, just in case one of the files got corrupted, I used a little
47 >>> trick to see if I can figure out which one may be bad if any. I took
48 >>> my rsync commands from my little script and ran them one at a time
49 >>> with --dry-run added. If a file was to be updated on the backup that
50 >>> I hadn't changed or added, I was going to check into it before
51 >>> updating my backups. It could be that the backup file was still good
52 >>> and the file on my drive reporting problems was bad. In that case, I
53 >>> would determine which was good and either restore it from backups or
54 >>> allow it to be updated if needed. Either way, I should have a good
55 >>> file since the drive claims to have fixed the problem. Now let us
56 >>> pray. :-D
57 >>>
58 >>> Drive isn't under warranty. I may have to start buying new drives from dealers. Sometimes I find drives that are pulled from systems and have very few hours on them. Still, warranty may not last long. Saves a lot of money tho.
59 >>>
60 >>> USPS claims drive is on the way. Left a distribution point and should update again when it gets close. First said Saturday, then said Friday. I think Friday is about right but if the wind blows right, maybe Thursday.
61 >>>
62 >>> I hope I have another port and power cable plug for the swap out. At
63 >>> least now, I can unmount it and swap without a lot of rebooting.
64 >>> Since it's on LVM, that part is easy. Regretfully I have experience
65 >>> on that process. :/
66 >>>
67 >>> Thanks to all.
68 >>>
69 >>> Dale
70 >>>
71 >>> :-) :-)
72 >>>
73 >>>
74 >> You can get up to 16X SATA PCI-e cards these days for pretty cheap. So as long as you have the power to run another drive or two there's not much reason not to do RAID on the important stuff. Also, the SATA protocol allows for port expanders, which are also pretty cheap.
75 >>
76 >> One of my favorite things about BTRFS is the data checksums. If the drive returns garbage, it turns into a read error. Also, if you can't do real RAID, but have excess space you can tell it to keep two copies of everything. Doesn't help with total drive failure, but does protect against the occasional failed sector. If you don't mind writes taking twice as long anyway.
77 >>
78 >> LMP
79 >
80 >
81 >I looked into a card a good while back and they were pretty pricey at the time. You happen to have some search terms I can search for on ebay, Amazon etc? I know some chipsets work better on Linux out of the box. I don't need to buy one that doesn't work or only works with the threat of a sledge hammer. lol I've also looked into that other thing, SAS? or something. It's been a while tho.
82 >
83 >I'm pretty good at doing backups. I do Gentoo updates on Saturday, and sometimes Sunday. While the updates are downloading, I update my backups. It's almost like a religion for me. I was just more cautious earlier. I suspect a file could be corrupted somewhere but wanted to be sure it wasn't something important. I have some files that if lost, I may not can download again. They don't exist. A few I got from some Govt archive that are really old but since removed, or at least I can't find them anymore.
84 >
85 >I've given serious thought to switching to BTRFS. Thing is, I'm still trying to get LVM figured out. Plus, LVM is well maintained and should be for a good long while, plus it works for me. Still, if I could afford to have several new drives all at once, I'd certainly play with it. It could very well be better. The one thing I wish, LVM had a GUI where you could do everything from it. During my recent rearrangement of drives, I learned that you can't do a lot of things within webmin. It does some things but not everything. Plus, you have to have a running GUI to use it. In that case, I had to unmount /home which meant no KDE, so no Webmin either. Still, that could cause trouble too. I dunno.
86 >
87 >Thanks.
88 >
89 >Dale
90 >
91 >:-) :-)
92 >
93 >
94
95 I went with a couple of https://www.amazon.com/MZHOU-Profile-Bracket-Support-Converter/dp/B08L7W8QFT/ in a couple different sizes for two of my mass storage systems and they seem to be doing OK.
96
97 The difference between the cheap vendors and the expensive vendors these days tends to be quality control. So plug it in, load it up, run it hard for a few hours. If it doesn't die relatively quickly you're usually good.
98
99 Especially if you have RAID with checksums it's difficult for a controller to mangle things too badly even if it does have an issue.
100
101 Remember: Data does not exist if it doesn't exist in at least three places. So you still want off-site backups in case your house burns down. Especially for irreplaceable things.
102
103 If you have friends who also want off-site backups and you leave your machines running all the time then tahoe-lafs is pretty decent. For that matter they don't even have to really be friends, you really only have to be able to trust them to not selfishly hog all the space.
104
105 I use BTRFS RAID1 for a lot of stuff. So far it's been pretty good at catching dropped bits and recovering from failures. It has a bit of the RAID issue where a drive could fail while you're doing a recovery since it only guarantees integrity with one dud drive regardless of the number of drives in the pool. But since each chunk is only written to two drives instead of spread across all of them the rebuild time stays relatively short and even if another drive does fail you'll only lose some of the data instead of all of it. This also means that the wasted space when your drives aren't all the same size is kept to a minimum.
106
107 ZFS and similar are arguably better for larger arrays, but are also more hassle to set up.
108
109 LVM is good for being able to swap out drives easily but with the modern, huge drives you really want data checksums if you can get them. Otherwise all it takes is a flipped bit somewhere to wreck your data and drive firmware doesn't always notice. I think you can do that with LVM, but I've never looked into it for certain.
110
111 LMP

Replies

Subject Author
Re: [gentoo-user] Hard drive error from SMART Wol <antlists@××××××××××××.uk>
Re: [gentoo-user] Hard drive error from SMART Dale <rdalek1967@×××××.com>