1 |
Rich Freeman posted on Fri, 21 Jun 2013 06:28:35 -0400 as excerpted: |
2 |
|
3 |
> On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@×××.net> wrote: |
4 |
>> So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k |
5 |
>> in data across three devices, and 8k of parity across the other two |
6 |
>> devices. |
7 |
> |
8 |
> With mdadm on a 5-device raid6 with 512K chunks you have 1.5M in a |
9 |
> stripe, not 20k. If you modify one block it needs to read all 1.5M, or |
10 |
> it needs to read at least the old chunk on the single drive to be |
11 |
> modified and both old parity chunks (which on such a small array is 3 |
12 |
> disks either way). |
13 |
|
14 |
I'll admit to not fully understanding chunks/stripes/strides in terms of |
15 |
actual size tho I believe you're correct, it's well over the filesystem |
16 |
block size and a half-meg is probably right. However, the original post |
17 |
went with a 4k blocksize, which is pretty standard as that's the usual |
18 |
memory page size as well so it makes for a convenient filesystem blocksize |
19 |
too, so that's what I was using as a base for my numbers. If it's 4k |
20 |
blocksize, then 5-device raid6 stripe would be 3*4k=12k of data, plus |
21 |
2*4k=8k of parity. |
22 |
|
23 |
>> Forth, back to the parity. Remember, raid5/6 has all that parity that |
24 |
>> it writes out (but basically never reads in normal mode, only when |
25 |
>> degraded, |
26 |
>> in ordered to reconstruct the data from the missing device(s)), but |
27 |
>> doesn't actually use it for integrity checking. |
28 |
> |
29 |
> I wasn't aware of this - I can't believe it isn't even an option either. |
30 |
> Note to self - start doing weekly scrubs... |
31 |
|
32 |
Indeed. That's one of the things that frustrated me with mdraid -- all |
33 |
that data integrity metadata there, but just going to waste in normal |
34 |
operation, only used for device recovery. |
35 |
|
36 |
Which itself can be a problem as well, because if there *IS* an |
37 |
undetected cosmic-ray-error or whatever and a device goes out, that means |
38 |
you'll lose integrity on a second device in the rebuild as well (if it |
39 |
was a data device that dropped out and not parity anyway), because the |
40 |
parity's screwed against the undetected error and will thus rebuild a bad |
41 |
copy of the data on the replacement device. |
42 |
|
43 |
And it's one of the things which so attracted me to btrfs, too, and why I |
44 |
was so frustrated to see it could only be a single redundancy (two-way- |
45 |
mirrored), no way to do more. The btrfs sales pitch talks about how |
46 |
great data integrity and the ability to go find a good copy when the |
47 |
data's bad, but what if the only allowed second copy is bad as well? |
48 |
OOPS! |
49 |
|
50 |
But as I said, N-way mirroring is on the btrfs roadmap, it's simply not |
51 |
there yet. |
52 |
|
53 |
>> The single down side to raid1 as opposed to raid5/6 is the loss of the |
54 |
>> extra space made available by the data striping, 3*single-device-space |
55 |
>> in the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space |
56 |
>> in the case of raid1. Otherwise, no contest, hands down, raid1 over |
57 |
>> raid6. |
58 |
> |
59 |
> This is a HUGE downside. The only downside to raid1 over not having |
60 |
> raid at all is that your disk space cost doubles. raid5/6 is |
61 |
> considerably cheaper in that regard. In a 5-disk raid5 the cost of |
62 |
> redundancy is only 25% more, vs a 100% additional cost for raid1. To |
63 |
> accomplish the same space as a 5-disk raid5 you'd need 8 disks. Sure, |
64 |
> read performance would be vastly superior, but if you're going to spend |
65 |
> $300 more on hard drives and whatever it takes to get so many SATA ports |
66 |
> on your system you could instead add an extra 32GB of RAM or put your OS |
67 |
> on a mirrored SSD. I suspect that both of those options on a typical |
68 |
> workload are going to make a far bigger improvement in performance. |
69 |
|
70 |
I'd suggest that with the exception of large database servers where the |
71 |
object is to be able to cache the entire db in RAM, the SSDs are likely a |
72 |
better option. |
73 |
|
74 |
FWIW, my general "gentoo/amd64 user" rule of thumb is 1-2 gig base, plus |
75 |
1-2 gig per core. Certainly that scale can slide either way and it'd |
76 |
probably slide down for folks not doing system rebuilds in tmpfs, as |
77 |
gentooers often do, but up or down, unless you put that ram in a battery- |
78 |
backed ramdisk, 32-gig is a LOT of ram, even for an 8-core. |
79 |
|
80 |
FWIW, with my old dual-dual-core (so four cores), 8 gig RAM was nicely |
81 |
roomy, tho I /did/ sometimes bump the top and thus end up either dumping |
82 |
either cache or swapping. When I upgraded to the 6-core, I used that |
83 |
rule of thumb and figured ~12 gig, but due to the power-of-twos |
84 |
efficiency rule, I ended up with 16 gig, figuring that was more than I'd |
85 |
use in practice, but better than limiting it to 8 gig. |
86 |
|
87 |
I was right. The 16 gig is certainly nice, but in reality, I'm typically |
88 |
entirely wasting several gigs of it, not even cache filling it up. I |
89 |
typically run ~1 gig in application memory and several gigs in cache, |
90 |
with only a few tens of MB in buffer. But while I'll often exceed my old |
91 |
capacity of 8 gig, it's seldom by much, and 12 gig would handle |
92 |
everything including cache without dumping at well over the 90th |
93 |
percentile, probably 97% or there abouts. Even with parallel make at |
94 |
both the ebuild and global portage level and with PORTAGE_TMPDIR in |
95 |
tmpfs, I hit 100% on the cores well before I run out of RAM and start |
96 |
dumping cache or swapping. The only time that has NOT been the case is |
97 |
when I deliberately saturate, say a kernel build with an open-ended -j so |
98 |
it stacks up several-hundred jobs at once. |
99 |
|
100 |
Meanwhile, the paired SSDs in btrfs raid1 make a HUGE practical |
101 |
difference, especially in things like the (cold-cache) portage tree (and |
102 |
overlays) sync, kernel git pull, etc. (In my case actual booting didn't |
103 |
get a huge boost as I run ntp-client and ntpd at boot, and the ntp-client |
104 |
time sync takes ~12 seconds, more than the rest of the boot put |
105 |
together. But cold-cache loading kde happens faster now -- I actually |
106 |
uninstalled the ksplash and just go text-console login to x-black-screen |
107 |
to kde/plasma desktop, now. But the tree sync and kernel pull are still |
108 |
the places I appreciate the SSDs most.) |
109 |
|
110 |
And notably, because the cold-cache system is so much faster with the |
111 |
SSDs, I tend to actually shut-down instead of suspending, now, so I tend |
112 |
to cache even less and thus use less memory with the SSDs than before. |
113 |
I /could/ probably do 8-gig RAM now instead of 16, and not miss it. Even |
114 |
a gig per core, 6-gig, wouldn't be terrible, tho below that would start |
115 |
to bottleneck and pinch a bit again I suspect. |
116 |
|
117 |
> Which is better really depends on your workload. In my case much of my |
118 |
> raid space is used my mythtv, or for storage of stuff I only |
119 |
> occasionally use. In these use cases the performance of the raid5 is |
120 |
> more than adequate, and I'd rather be able to keep shows around for an |
121 |
> extra 6 months in HD than have the DVR respond a millisecond faster when |
122 |
> I hit play. If you really have sustained random access of the bulk of |
123 |
> your data than a raid1 would make much more sense. |
124 |
|
125 |
Definitely. For mythTV or similar massive media needs, raid5 will be |
126 |
fast enough. And I suspect just the single device-loss tolerance is a |
127 |
reasonable risk tradeoff for you too, since after all it /is/ just media, |
128 |
so tolerating loss of a single device is good, but the risk of losing two |
129 |
before a full rebuild with a replacement if one fails is acceptable, |
130 |
given the cost vs. size tradeoff with the massive size requirements of |
131 |
video. |
132 |
|
133 |
But again, the OP seemed to find his speed benchmarks disappointing, to |
134 |
say the least, and I believe pointing out raid6 as the culprit is |
135 |
accurate. Which, given his production-rating reliability stock trading |
136 |
VMs usage, I'm guessing raid5/6 really isn't the ideal match. Massive |
137 |
media, yes, definitely. Massive VMs, not so much. |
138 |
|
139 |
>> So several points on btrfs: |
140 |
>> |
141 |
>> 1) It's still in heavy development. |
142 |
> |
143 |
> That is what is keeping me away. I won't touch it until I can use it |
144 |
> with raid5, and the first common containing that hit the kernel weeks |
145 |
> ago I think (and it has known gaps). Until it is stable I'm sticking |
146 |
> with my current setup. |
147 |
|
148 |
Question: Would you use it for raid1 yet, as I'm doing? What about as a |
149 |
single-device filesystem? Do you believe my estimates of reliability in |
150 |
those cases (almost but not quite stable for single-device, kind of in |
151 |
the middle for raid1/raid0/raid10, say a year behind single-device and |
152 |
raid5/6/50/60 about a year behind that) reasonably accurate? |
153 |
|
154 |
Because if you're waiting until btrfs raid5 is fully stable, that's |
155 |
likely to be some wait yet -- I'd say a year, likely more given that |
156 |
everything btrfs has seemed to take longer than people expected. But if |
157 |
you're simply waiting until it matures to the point that say btrfs raid1 |
158 |
is at now, or maybe even a bit less, but certainly to where it's complete |
159 |
plus say a kernel release to work out a few more wrinkles, then that's |
160 |
quite possible by year-end. |
161 |
|
162 |
>> 2) RAID levels work QUITE a bit differently on btrfs. In particular, |
163 |
>> what btrfs calls raid1 mode (with the same applying to raid10) is |
164 |
>> simply two-way-mirroring, NO MATTER THE NUMBER OF DEVICES. There's no |
165 |
>> multi-way mirroring yet available |
166 |
> |
167 |
> Odd, for some reason I thought it let you specify arbitrary numbers of |
168 |
> copies, but looking around I think you're right. It does store two |
169 |
> copies of metadata regardless of the number of drives unless you |
170 |
> override this. |
171 |
|
172 |
Default is single-copy data, dual-copy metadata, regardless of number of |
173 |
devices (single device does DUP metadata, two copies on the same device, |
174 |
by default), with the exception of SSDs, where the metadata default is |
175 |
single since many of the SSD firmwares (sandforce firmware, with its |
176 |
compression features, is known to do this, tho mine, I forgot the |
177 |
firmware brand ATM but it's Corsair Neutron SSDs aimed at the server/ |
178 |
workstation market where unpredictability isn't considered a feature, |
179 |
doesn't as one of its features is stable performance and usage regardless |
180 |
of the data its fed) do dedup on identical copy data anyway. At least |
181 |
that's the explanation given for the SSD exception. |
182 |
|
183 |
But the real gotcha is that there's no way to setup N-way (N>2) |
184 |
redundancy on btrfs raid1/10, and I know for a fact that catches some |
185 |
admins by nasty surprise, as I've seen it come up on the btrfs list as |
186 |
well as had my own personal disappointment with it, tho luckily I did my |
187 |
research and figured that out before I actually installed on btrfs. |
188 |
|
189 |
I just wish they'd called it 2-way-mirroring instead of raid1, as that |
190 |
wouldn't be the deception in labeling that I consider the btrfs raid1 |
191 |
moniker at this point, and admins would be far less likely to be caught |
192 |
unaware when a second device goes haywire that they /thought/ they'd be |
193 |
covered for. Of course at this point it's all still development anyway, |
194 |
so no sane admin is going to be lacking backups in any case, but there's |
195 |
a lot of people flying by the seat of their pants out there, who have NOT |
196 |
done the research, and I they show up frequently on the btrfs list, after |
197 |
it's too late. (Tho certainly there's less of them showing up now than |
198 |
a year ago, when I first investigated btrfs, I think both due to btrfs |
199 |
maturing quite a bit since then and to a lot of the original btrfs hype |
200 |
dying down, which is a good thing considering the number of folks that |
201 |
were installing it, only to find out once they lost data that it was |
202 |
still development.) |
203 |
|
204 |
> However, if one considered raid1 expensive, having multiple layers of |
205 |
> redundancy is REALLY expensive if you aren't using Reed Solomon and many |
206 |
> data disks. |
207 |
|
208 |
Well, depending on the use case. In your media case, certainly. |
209 |
However, that's one of the few cases that still gobbles storage space as |
210 |
fast as the manufacturers up their capacities, and that is likely to |
211 |
continue to do so for at least a few more years, given that HD is still |
212 |
coming in, so a lot of the media is still SD, and with quad-HD in the |
213 |
wings as well, now. But once we hit half-petabyte, I suppose even quad-HD |
214 |
won't be gobbling the space as fast as they can upgrade it, any more. So |
215 |
a half-decade or so, maybe? |
216 |
|
217 |
Plus of course the shear bandwidth requirements for quad-HD are |
218 |
astounding, so at that point either some serious raid0/x0 raid or ssds |
219 |
for the speed will be pretty mandatory anyway, remaining SSD size limits |
220 |
or no SSD size limits. |
221 |
|
222 |
> From my standpoint I don't think raid1 is the best use of money in most |
223 |
> cases, either for performance OR for data security. If you want |
224 |
> performance the money is probably better spent on other components. If |
225 |
> you want data security the money is probably better spent on offline |
226 |
> backups. However, this very-much depends on how the disks will be used |
227 |
> - there are certainly cases where raid1 is your best option. |
228 |
|
229 |
I agree when the use is primarily video media. Other than that, a pair |
230 |
of 2 TB spinning rust drives tends to still go quite a long way, and |
231 |
tends to be a pretty good cost/risk tradeoff IMO. Throwing in a third 2- |
232 |
TB drive for three-way raid1 mirroring is often a good idea as well, |
233 |
where the additional data security is needed, but beyond that, the cost/ |
234 |
benefit balance probably doesn't make a whole lot of sense, agreed. |
235 |
|
236 |
And offline backups are important too, but with dual 2TB drives, many |
237 |
people can live with a TB of data and do multiple raid1s, giving |
238 |
themselves both logically offline backup and physical device redundancy. |
239 |
And if that means they do backups to the second raid set on the same |
240 |
physical devices more reliably than they would with an external that they |
241 |
have to physically look for and/or attach each time (as turned out to be |
242 |
the case for me), then the pair of 2TB drives is quite a reasonable |
243 |
investment indeed. |
244 |
|
245 |
But if you're going for performance, spinning rust raid simply doesn't |
246 |
cut it at the consumer level any longer. SSD at least the commonly used |
247 |
data, leaving say the media data on spinning rust for the time being if |
248 |
the budget doesn't work otherwise, as I've actually done here with my |
249 |
(much smaller than yours) media collection, figuring it not worth the |
250 |
cost to put /it/ on SSD just yet. |
251 |
|
252 |
-- |
253 |
Duncan - List replies preferred. No HTML msgs. |
254 |
"Every nonfree program has a lord, a master -- |
255 |
and if you use the program, he is your master." Richard Stallman |