Gentoo Archives: gentoo-amd64

From: Duncan <1i5t5.duncan@×××.net>
To: gentoo-amd64@l.g.o
Subject: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
Date: Fri, 21 Jun 2013 14:28:08
Message-Id: pan$b657d$d0d9cd1a$724ca571$a36894cc@cox.net
In Reply to: Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? by Rich Freeman
1 Rich Freeman posted on Fri, 21 Jun 2013 06:28:35 -0400 as excerpted:
2
3 > On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@×××.net> wrote:
4 >> So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k
5 >> in data across three devices, and 8k of parity across the other two
6 >> devices.
7 >
8 > With mdadm on a 5-device raid6 with 512K chunks you have 1.5M in a
9 > stripe, not 20k. If you modify one block it needs to read all 1.5M, or
10 > it needs to read at least the old chunk on the single drive to be
11 > modified and both old parity chunks (which on such a small array is 3
12 > disks either way).
13
14 I'll admit to not fully understanding chunks/stripes/strides in terms of
15 actual size tho I believe you're correct, it's well over the filesystem
16 block size and a half-meg is probably right. However, the original post
17 went with a 4k blocksize, which is pretty standard as that's the usual
18 memory page size as well so it makes for a convenient filesystem blocksize
19 too, so that's what I was using as a base for my numbers. If it's 4k
20 blocksize, then 5-device raid6 stripe would be 3*4k=12k of data, plus
21 2*4k=8k of parity.
22
23 >> Forth, back to the parity. Remember, raid5/6 has all that parity that
24 >> it writes out (but basically never reads in normal mode, only when
25 >> degraded,
26 >> in ordered to reconstruct the data from the missing device(s)), but
27 >> doesn't actually use it for integrity checking.
28 >
29 > I wasn't aware of this - I can't believe it isn't even an option either.
30 > Note to self - start doing weekly scrubs...
31
32 Indeed. That's one of the things that frustrated me with mdraid -- all
33 that data integrity metadata there, but just going to waste in normal
34 operation, only used for device recovery.
35
36 Which itself can be a problem as well, because if there *IS* an
37 undetected cosmic-ray-error or whatever and a device goes out, that means
38 you'll lose integrity on a second device in the rebuild as well (if it
39 was a data device that dropped out and not parity anyway), because the
40 parity's screwed against the undetected error and will thus rebuild a bad
41 copy of the data on the replacement device.
42
43 And it's one of the things which so attracted me to btrfs, too, and why I
44 was so frustrated to see it could only be a single redundancy (two-way-
45 mirrored), no way to do more. The btrfs sales pitch talks about how
46 great data integrity and the ability to go find a good copy when the
47 data's bad, but what if the only allowed second copy is bad as well?
48 OOPS!
49
50 But as I said, N-way mirroring is on the btrfs roadmap, it's simply not
51 there yet.
52
53 >> The single down side to raid1 as opposed to raid5/6 is the loss of the
54 >> extra space made available by the data striping, 3*single-device-space
55 >> in the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space
56 >> in the case of raid1. Otherwise, no contest, hands down, raid1 over
57 >> raid6.
58 >
59 > This is a HUGE downside. The only downside to raid1 over not having
60 > raid at all is that your disk space cost doubles. raid5/6 is
61 > considerably cheaper in that regard. In a 5-disk raid5 the cost of
62 > redundancy is only 25% more, vs a 100% additional cost for raid1. To
63 > accomplish the same space as a 5-disk raid5 you'd need 8 disks. Sure,
64 > read performance would be vastly superior, but if you're going to spend
65 > $300 more on hard drives and whatever it takes to get so many SATA ports
66 > on your system you could instead add an extra 32GB of RAM or put your OS
67 > on a mirrored SSD. I suspect that both of those options on a typical
68 > workload are going to make a far bigger improvement in performance.
69
70 I'd suggest that with the exception of large database servers where the
71 object is to be able to cache the entire db in RAM, the SSDs are likely a
72 better option.
73
74 FWIW, my general "gentoo/amd64 user" rule of thumb is 1-2 gig base, plus
75 1-2 gig per core. Certainly that scale can slide either way and it'd
76 probably slide down for folks not doing system rebuilds in tmpfs, as
77 gentooers often do, but up or down, unless you put that ram in a battery-
78 backed ramdisk, 32-gig is a LOT of ram, even for an 8-core.
79
80 FWIW, with my old dual-dual-core (so four cores), 8 gig RAM was nicely
81 roomy, tho I /did/ sometimes bump the top and thus end up either dumping
82 either cache or swapping. When I upgraded to the 6-core, I used that
83 rule of thumb and figured ~12 gig, but due to the power-of-twos
84 efficiency rule, I ended up with 16 gig, figuring that was more than I'd
85 use in practice, but better than limiting it to 8 gig.
86
87 I was right. The 16 gig is certainly nice, but in reality, I'm typically
88 entirely wasting several gigs of it, not even cache filling it up. I
89 typically run ~1 gig in application memory and several gigs in cache,
90 with only a few tens of MB in buffer. But while I'll often exceed my old
91 capacity of 8 gig, it's seldom by much, and 12 gig would handle
92 everything including cache without dumping at well over the 90th
93 percentile, probably 97% or there abouts. Even with parallel make at
94 both the ebuild and global portage level and with PORTAGE_TMPDIR in
95 tmpfs, I hit 100% on the cores well before I run out of RAM and start
96 dumping cache or swapping. The only time that has NOT been the case is
97 when I deliberately saturate, say a kernel build with an open-ended -j so
98 it stacks up several-hundred jobs at once.
99
100 Meanwhile, the paired SSDs in btrfs raid1 make a HUGE practical
101 difference, especially in things like the (cold-cache) portage tree (and
102 overlays) sync, kernel git pull, etc. (In my case actual booting didn't
103 get a huge boost as I run ntp-client and ntpd at boot, and the ntp-client
104 time sync takes ~12 seconds, more than the rest of the boot put
105 together. But cold-cache loading kde happens faster now -- I actually
106 uninstalled the ksplash and just go text-console login to x-black-screen
107 to kde/plasma desktop, now. But the tree sync and kernel pull are still
108 the places I appreciate the SSDs most.)
109
110 And notably, because the cold-cache system is so much faster with the
111 SSDs, I tend to actually shut-down instead of suspending, now, so I tend
112 to cache even less and thus use less memory with the SSDs than before.
113 I /could/ probably do 8-gig RAM now instead of 16, and not miss it. Even
114 a gig per core, 6-gig, wouldn't be terrible, tho below that would start
115 to bottleneck and pinch a bit again I suspect.
116
117 > Which is better really depends on your workload. In my case much of my
118 > raid space is used my mythtv, or for storage of stuff I only
119 > occasionally use. In these use cases the performance of the raid5 is
120 > more than adequate, and I'd rather be able to keep shows around for an
121 > extra 6 months in HD than have the DVR respond a millisecond faster when
122 > I hit play. If you really have sustained random access of the bulk of
123 > your data than a raid1 would make much more sense.
124
125 Definitely. For mythTV or similar massive media needs, raid5 will be
126 fast enough. And I suspect just the single device-loss tolerance is a
127 reasonable risk tradeoff for you too, since after all it /is/ just media,
128 so tolerating loss of a single device is good, but the risk of losing two
129 before a full rebuild with a replacement if one fails is acceptable,
130 given the cost vs. size tradeoff with the massive size requirements of
131 video.
132
133 But again, the OP seemed to find his speed benchmarks disappointing, to
134 say the least, and I believe pointing out raid6 as the culprit is
135 accurate. Which, given his production-rating reliability stock trading
136 VMs usage, I'm guessing raid5/6 really isn't the ideal match. Massive
137 media, yes, definitely. Massive VMs, not so much.
138
139 >> So several points on btrfs:
140 >>
141 >> 1) It's still in heavy development.
142 >
143 > That is what is keeping me away. I won't touch it until I can use it
144 > with raid5, and the first common containing that hit the kernel weeks
145 > ago I think (and it has known gaps). Until it is stable I'm sticking
146 > with my current setup.
147
148 Question: Would you use it for raid1 yet, as I'm doing? What about as a
149 single-device filesystem? Do you believe my estimates of reliability in
150 those cases (almost but not quite stable for single-device, kind of in
151 the middle for raid1/raid0/raid10, say a year behind single-device and
152 raid5/6/50/60 about a year behind that) reasonably accurate?
153
154 Because if you're waiting until btrfs raid5 is fully stable, that's
155 likely to be some wait yet -- I'd say a year, likely more given that
156 everything btrfs has seemed to take longer than people expected. But if
157 you're simply waiting until it matures to the point that say btrfs raid1
158 is at now, or maybe even a bit less, but certainly to where it's complete
159 plus say a kernel release to work out a few more wrinkles, then that's
160 quite possible by year-end.
161
162 >> 2) RAID levels work QUITE a bit differently on btrfs. In particular,
163 >> what btrfs calls raid1 mode (with the same applying to raid10) is
164 >> simply two-way-mirroring, NO MATTER THE NUMBER OF DEVICES. There's no
165 >> multi-way mirroring yet available
166 >
167 > Odd, for some reason I thought it let you specify arbitrary numbers of
168 > copies, but looking around I think you're right. It does store two
169 > copies of metadata regardless of the number of drives unless you
170 > override this.
171
172 Default is single-copy data, dual-copy metadata, regardless of number of
173 devices (single device does DUP metadata, two copies on the same device,
174 by default), with the exception of SSDs, where the metadata default is
175 single since many of the SSD firmwares (sandforce firmware, with its
176 compression features, is known to do this, tho mine, I forgot the
177 firmware brand ATM but it's Corsair Neutron SSDs aimed at the server/
178 workstation market where unpredictability isn't considered a feature,
179 doesn't as one of its features is stable performance and usage regardless
180 of the data its fed) do dedup on identical copy data anyway. At least
181 that's the explanation given for the SSD exception.
182
183 But the real gotcha is that there's no way to setup N-way (N>2)
184 redundancy on btrfs raid1/10, and I know for a fact that catches some
185 admins by nasty surprise, as I've seen it come up on the btrfs list as
186 well as had my own personal disappointment with it, tho luckily I did my
187 research and figured that out before I actually installed on btrfs.
188
189 I just wish they'd called it 2-way-mirroring instead of raid1, as that
190 wouldn't be the deception in labeling that I consider the btrfs raid1
191 moniker at this point, and admins would be far less likely to be caught
192 unaware when a second device goes haywire that they /thought/ they'd be
193 covered for. Of course at this point it's all still development anyway,
194 so no sane admin is going to be lacking backups in any case, but there's
195 a lot of people flying by the seat of their pants out there, who have NOT
196 done the research, and I they show up frequently on the btrfs list, after
197 it's too late. (Tho certainly there's less of them showing up now than
198 a year ago, when I first investigated btrfs, I think both due to btrfs
199 maturing quite a bit since then and to a lot of the original btrfs hype
200 dying down, which is a good thing considering the number of folks that
201 were installing it, only to find out once they lost data that it was
202 still development.)
203
204 > However, if one considered raid1 expensive, having multiple layers of
205 > redundancy is REALLY expensive if you aren't using Reed Solomon and many
206 > data disks.
207
208 Well, depending on the use case. In your media case, certainly.
209 However, that's one of the few cases that still gobbles storage space as
210 fast as the manufacturers up their capacities, and that is likely to
211 continue to do so for at least a few more years, given that HD is still
212 coming in, so a lot of the media is still SD, and with quad-HD in the
213 wings as well, now. But once we hit half-petabyte, I suppose even quad-HD
214 won't be gobbling the space as fast as they can upgrade it, any more. So
215 a half-decade or so, maybe?
216
217 Plus of course the shear bandwidth requirements for quad-HD are
218 astounding, so at that point either some serious raid0/x0 raid or ssds
219 for the speed will be pretty mandatory anyway, remaining SSD size limits
220 or no SSD size limits.
221
222 > From my standpoint I don't think raid1 is the best use of money in most
223 > cases, either for performance OR for data security. If you want
224 > performance the money is probably better spent on other components. If
225 > you want data security the money is probably better spent on offline
226 > backups. However, this very-much depends on how the disks will be used
227 > - there are certainly cases where raid1 is your best option.
228
229 I agree when the use is primarily video media. Other than that, a pair
230 of 2 TB spinning rust drives tends to still go quite a long way, and
231 tends to be a pretty good cost/risk tradeoff IMO. Throwing in a third 2-
232 TB drive for three-way raid1 mirroring is often a good idea as well,
233 where the additional data security is needed, but beyond that, the cost/
234 benefit balance probably doesn't make a whole lot of sense, agreed.
235
236 And offline backups are important too, but with dual 2TB drives, many
237 people can live with a TB of data and do multiple raid1s, giving
238 themselves both logically offline backup and physical device redundancy.
239 And if that means they do backups to the second raid set on the same
240 physical devices more reliably than they would with an external that they
241 have to physically look for and/or attach each time (as turned out to be
242 the case for me), then the pair of 2TB drives is quite a reasonable
243 investment indeed.
244
245 But if you're going for performance, spinning rust raid simply doesn't
246 cut it at the consumer level any longer. SSD at least the commonly used
247 data, leaving say the media data on spinning rust for the time being if
248 the budget doesn't work otherwise, as I've actually done here with my
249 (much smaller than yours) media collection, figuring it not worth the
250 cost to put /it/ on SSD just yet.
251
252 --
253 Duncan - List replies preferred. No HTML msgs.
254 "Every nonfree program has a lord, a master --
255 and if you use the program, he is your master." Richard Stallman

Replies