1 |
On 08/07/2013 20:27, Stefan G. Weichinger wrote: |
2 |
> Am 08.07.2013 17:58, schrieb Alan McKinnon: |
3 |
>> On 08/07/2013 17:39, Paul Hartman wrote: |
4 |
>>> On Thu, Jul 4, 2013 at 9:04 PM, Paul Hartman |
5 |
>>> <paul.hartman+gentoo@×××××.com> wrote: |
6 |
>>>> ST4000DM000 |
7 |
>>> |
8 |
>>> As a side-note these two Seagate 4TB "Desktop" edition drives I bought |
9 |
>>> already, after about than 100 hours of power-on usage, both drives |
10 |
>>> have each encountered dozens of unreadable sectors so far. I was able |
11 |
>>> to correct them (force reallocation) using hdparm... So it should be |
12 |
>>> "fixed", and I'm reading that this is "normal" with newer drives and |
13 |
>>> "don't worry about it", but I'm still coming from the time when 1 bad |
14 |
>>> sector = red alert, replace the drive ASAP. I guess I will need to |
15 |
>>> monitor and see if it gets worse. |
16 |
>>> |
17 |
>> |
18 |
>> |
19 |
>> Way back when in the bad old days of drives measured in 100s of megs, |
20 |
>> you'd get a few bad sectors now and then, and would have to mark them as |
21 |
>> faulty. This didn't bother us then much |
22 |
>> |
23 |
>> Nowadays we have drives that are 8,000 bigger than that so all other |
24 |
>> things being equal we'd expect sectors to fail 8,000 time more (more |
25 |
>> being a very fuzzy concept, and I know full well I'm using it loosely :-) ) |
26 |
>> |
27 |
>> Our drives nowadays also have smart firmware, something we had to |
28 |
>> introduce when CHS no longer cut it, this lead to sector failures being |
29 |
>> somewhat "invisible" leaving us with the happy delusion that drives were |
30 |
>> vastly reliable etc etc etc. But you know all this. |
31 |
>> |
32 |
>> A mere few dozen failures in the first 100 hours is a failure rate of |
33 |
>> (Alan whips out the trust sci calculator) 4.8E-6%. Pretty damn |
34 |
>> spectacular if you ask me and WELL within probabilities. |
35 |
>> |
36 |
>> There is likely nothing wrong with your drives. If they are faulty, it's |
37 |
>> highly likely a systemic manufacturing fault of the mechanicals (servo |
38 |
>> systems, motor bearing etc) |
39 |
>> |
40 |
>> You do realize that modern hard drives have for the longest time been up |
41 |
>> there in the Top X list of Most Reliable Devices Made By Mankind Ever? |
42 |
> |
43 |
> Does it make sense to apply some sort of burn-in-procedure before |
44 |
> actually formatting and using the disks? Running badblocks or something? |
45 |
> |
46 |
> I ask because I wait for that shiny new server and doing so might not |
47 |
> hurt before installing gentoo. Or is that too paranoid and a waste of time? |
48 |
|
49 |
If it makes you feel better, then by all means go through the motions |
50 |
. |
51 |
|
52 |
For my money, I reckon that's exactly what it is - motions and ritual. I |
53 |
havew any anecdotal evidence to back it up, but it's fairly strong |
54 |
anecdotal evidence: |
55 |
|
56 |
Over the last 5 years, the team I'm in, the teams we work closely with |
57 |
and the Storage guys have commissioned >1000 pieces of hardware and |
58 |
probably more than 4000 drives, the vast majority from Dell. I have no |
59 |
idea what burn-in Dell applies, if any. We've had our fair share of |
60 |
infant mortality failures, prob ably less than 20 in 5 years. And here's |
61 |
the kicker - every single one failed in production. |
62 |
|
63 |
Most of that hardware, and ALL of the SANs, went through heavy |
64 |
pre-deployment testing. Usually, this means cloning the -dev system onto |
65 |
it and running the crap out of it for a decent length of time. Once the |
66 |
techies were happy, install the production version and switch it on. |
67 |
|
68 |
I conclude that the likely reason we only found failure in prod is that |
69 |
only prod gives a decent viable test that approximates real life and dev |
70 |
is always a mere simulation. It's not usage that kills a few drives |
71 |
early, it's the almost random pattern of disk access that you get in |
72 |
real life. That tends to shake out the weak links better than any test. |
73 |
|
74 |
However, this is all anecdotal so use or discard as you see fit :-). I |
75 |
no longer worry about data loss as we have 4 hour warranty turnaround |
76 |
SLAs in place and company policy is to only deploy storage that is |
77 |
guaranteed to survive loss of any one drive in an array. |
78 |
|
79 |
|
80 |
-- |
81 |
Alan McKinnon |
82 |
alan.mckinnon@×××××.com |