1 |
On 03/05/2020 22:46, Caveman Al Toraboran wrote: |
2 |
> On Sunday, May 3, 2020 6:27 PM, Jack <ostroffjh@×××××××××××××××××.net> wrote: |
3 |
> |
4 |
> |
5 |
> curious. how do people look at --layout=n2 in the |
6 |
> storage industry? e.g. do they ignore the |
7 |
> optimistic case where 2 disk failures can be |
8 |
> recovered, and only assume that it protects for 1 |
9 |
> disk failure? |
10 |
|
11 |
You CANNOT afford to be optimistic ... Murphy's law says you will lose |
12 |
the wrong second disk. |
13 |
> |
14 |
> i see why gambling is not worth it here, but at |
15 |
> the same time, i see no reason to ignore reality |
16 |
> (that a 2 disk failure can be saved). |
17 |
> |
18 |
Don't ignore that some 2-disk failures CAN'T be saved ... |
19 |
|
20 |
> e.g. a 4-disk RAID10 with -layout=n2 gives |
21 |
> |
22 |
> 1*4/10 + 2*4/10 = 1.2 |
23 |
> |
24 |
> expected recoverable disk failures. details are |
25 |
> below: |
26 |
> |
27 |
> |
28 |
> now, if we do a 5-disk --layout=n2, we get: |
29 |
> |
30 |
> 1 (1) 2 (2) 3 |
31 |
> (3) 4 (4) 5 (5) |
32 |
> 6 (6) 7 (7) 8 |
33 |
> (8) 9 (9) 10 (10) |
34 |
> 11 (11) 12 (12) 13 |
35 |
> (13) ... |
36 |
> |
37 |
> obviously, there are 5 possible ways a single disk |
38 |
> may fail, out of which all of the 5 will be |
39 |
> recovered. |
40 |
|
41 |
Don't forget a 4+spare layout, which *should* survive a 2-disk failure. |
42 |
> |
43 |
> there are nchoosek(5,2) = 10 possible ways a 2 |
44 |
> disk failure could happen, out of which 5 |
45 |
> will be recovered: |
46 |
> |
47 |
> |
48 |
> so, by transforming a 4-disk RAID10 into a 5-disk |
49 |
> one, we increase total storage capacity by a 0.5 |
50 |
> disk's worth of storage, while losing the ability |
51 |
> to recover 0.2 disks. |
52 |
> |
53 |
> but if we extended the 4-disk RAID10 into a |
54 |
> 6-disk --layout=n2, we will have: |
55 |
> |
56 |
> 6 nchoosek(6,2) - 3 |
57 |
> = 1 * ----------------- + 2 * ----------------- |
58 |
> 6 + nchoosek(6,2) 6 + nchoosek(6,2) |
59 |
> |
60 |
> = 6/21 + 2 * 12/15 |
61 |
> |
62 |
> = 1.8857 expected recoverable failing disks. |
63 |
> |
64 |
> almost 2. i.e. there is 80% chance of surviving a |
65 |
> 2 disk failure. |
66 |
> |
67 |
> so, i wonder, is it a bad decision to go with an |
68 |
> even number disks with a RAID10? what is the |
69 |
> right way to think to find an answer to this |
70 |
> question? |
71 |
> |
72 |
> i guess the ultimate answer needs knowledge of |
73 |
> these: |
74 |
> |
75 |
> * F1: probability of having 1 disks fail within |
76 |
> the repair window. |
77 |
> * F2: probability of having 2 disks fail within |
78 |
> the repair window. |
79 |
> * F3: probability of having 3 disks fail within |
80 |
> . the repair window. |
81 |
> . |
82 |
> . |
83 |
> * Fn: probability of having n disks fail within |
84 |
> the repair window. |
85 |
> |
86 |
> * R1: probability of surviving 1 disks failure. |
87 |
> equals 1 with all related cases. |
88 |
> * R2: probability of surviving 2 disks failure. |
89 |
> equals 1/3 with 5-disk RAID10 |
90 |
> equals 0.8 with a 6-disk RAID10. |
91 |
> * R3: probability of surviving 3 disks failure. |
92 |
> equals 0 with all related cases. |
93 |
> . |
94 |
> . |
95 |
> . |
96 |
> * Rn: probability of surviving n disks failure. |
97 |
> equals 0 with all related cases. |
98 |
> |
99 |
> * L : expected cost of losing data on an array. |
100 |
> * D : price of a disk. |
101 |
|
102 |
Don't forget, if you have a spare disk, the repair window is the length |
103 |
of time it takes to fail-over ... |
104 |
> |
105 |
> this way, the absolute expected cost when adopting |
106 |
> a 6-disk RAID10 is: |
107 |
> |
108 |
> = 6D + F1*(1-R1)*L + F2*(1-R2)*L + F3*(1-R3)*L + ... |
109 |
> = 6D + F1*(1-1)*L + F2*(1-0.8)*L + F3*(1-0)*L + ... |
110 |
> = 6D + 0 + F2*(0.2)*L + F3*(1-0)*L + ... |
111 |
> |
112 |
> and the absolute cost for a 5-disk RAID10 is: |
113 |
> |
114 |
> = 5D + F1*(1-1)*L + F2*(1-0.3333)*L + F3*(1-0)*L + ... |
115 |
> = 5D + 0 + F2*(0.6667)*L + F3*(1-0)*L + ... |
116 |
> |
117 |
> canceling identical terms, the difference cost is: |
118 |
> |
119 |
> 6-disk ===> 6D + 0.2*F2*L |
120 |
> 5-disk ===> 5D + 0.6667*F2*L |
121 |
> |
122 |
> from here [1] we know that a 1TB disk costs |
123 |
> $35.85, so: |
124 |
> |
125 |
> 6-disk ===> 6*35.85 + 0.2*F2*L |
126 |
> 5-disk ===> 5*35.85 + 0.6667*F2*L |
127 |
> |
128 |
> now, at which point is a 5-disk array a better |
129 |
> economical decision than a 6-disk one? for |
130 |
> simplicity, let LOL = F2*L: |
131 |
> |
132 |
> 5*35.85 + 0.6667 * LOL < 6*35.85 + 0.2 * LOL |
133 |
> 0.6667*LOL - 0.2 * LOL < 6*35.85 - 5*35.85 |
134 |
> LOL * (0.6667 - 0.2) < 6*35.85 - 5*35.85 |
135 |
> |
136 |
> 6*35.85 - 5*35.85 |
137 |
> LOL < ----------------- |
138 |
> 0.6667 - 0.2 |
139 |
> |
140 |
> LOL < 76.816 |
141 |
> F2*L < 76.816 |
142 |
> |
143 |
> so, a 5-disk RAID10 is better than a 6-disk RAID10 |
144 |
> only if: |
145 |
> |
146 |
> F2*L < 76.816 bucks. |
147 |
> |
148 |
> this site [2] says that 76% of seagate disks fail |
149 |
> per year (:D). and since disks fail independent |
150 |
> of each other mostly, then, the probabilty of |
151 |
> having 2 disks fail in a year is: |
152 |
> |
153 |
76% seems incredibly high. And no, disks do not fail independently of |
154 |
each other. If you buy a bunch of identical disks, at the same time, and |
155 |
stick them all in the same raid array, the chances of them all wearing |
156 |
out at the same time are rather higher than random chance would suggest. |
157 |
|
158 |
Which is why, if a raid disk fails, the advice is always to replace it |
159 |
asap. And if possible, to recover the failed drive to try and copy that |
160 |
rather than hammer the rest of the raid. |
161 |
|
162 |
Bear in mind that, it doesn't matter how many drives a raid-10 has, if |
163 |
you're recovering on to a new drive, the data is stored on just two of |
164 |
the other drives. So the chances of them failing as they get hammered |
165 |
are a lot higher. |
166 |
|
167 |
That's why it makes a lot of sense to make sure you monitor the SMARTs, |
168 |
so you can replace any of the drives that look like failing before they |
169 |
actually do. And check the warranties. Expensive raid drives probably |
170 |
have longer warranties, so when they're out of warranty consider |
171 |
retiring them (they'll probably last a lot longer, but it's a judgement |
172 |
call). |
173 |
|
174 |
All that said, I've been running a raid-1 mirror for a good few years, |
175 |
and I've not had any trouble on my Barracudas. |
176 |
|
177 |
Cheers, |
178 |
Wol |