Gentoo Archives: gentoo-amd64

From: Duncan <1i5t5.duncan@×××.net>
To: gentoo-amd64@l.g.o
Subject: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
Date: Sat, 22 Jun 2013 14:24:11
Message-Id: pan$22d39$31ef3544$39d12fd7$3d33aa30@cox.net
In Reply to: Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? by Mark Knecht
1 Mark Knecht posted on Fri, 21 Jun 2013 10:40:48 -0700 as excerpted:
2
3 > On Fri, Jun 21, 2013 at 12:31 AM, Duncan <1i5t5.duncan@×××.net> wrote:
4 > <SNIP>
5 >
6 > Wonderful post but much too long to carry on a conversation
7 > in-line.
8
9 FWIW... I'd have a hard time doing much of anything else, these days, no
10 matter the size. Otherwise, I'd be likely to forget a point. But I do
11 try to snip or summarize when possible. And I do understand your choice
12 and agree with it for you. It's just not one I'd find workable for me...
13 which is why I'm back to inline, here.
14
15 > As you sound pretty sure of your understanding/history I'll
16 > assume you're right 100% of the time, but only maybe 80% of the post
17 > feels right to me at this time so let's assume I have much to learn and
18 > go from there.
19
20 That's a very nice way of saying "I'll have to verify that before I can
21 fully agree, but we'll go with it for now." I'll have to remember it!
22 =:^)
23
24 > In thinking about this issue this morning I think it's important to
25 > me to get down to basics and verify as much as possible, step-by-step,
26 > so that I don't layer good work on top of bad assumptions.
27
28 Extremely reasonable approach. =:^)
29
30 > Basic Machine - ASUS Rampage II Extreme motherboard (4/1/2010) + 24GB
31 > DDR3 + Core i7-980x Extreme 12 core processor
32
33 That's a very impressive base. But as you point out elsewhere, you use
34 it. Multiple VMs running MS should well use use both the dozen cores and
35 the 24 gig RAM.
36
37 As an aside, it's interesting how well your dozen cores, 24 gig RAM, fits
38 my basic two gigs a core rule of thumb. Obviously I'd consider that
39 reasonably well balanced RAM/cores-wise.
40
41 > 1 SDD - 120GB SATA3 on it's own controller
42 > 5+ HDD - WD5002ABYS RAID Edition 3 SATA3 drives
43 > using Intel integrated controllers
44 >
45 > (NOTE: I can possibly go to a 6-drive RAID if I made some changes in the
46 > box but that's for later)
47 >
48 > According to the WD spec
49 > (http://www.wdc.com/en/library/spec/2879-701281.pdf) the 500GB drives
50
51 OK, single 120 gig main drive (SSD), 5 half-TB drives for the raid.
52
53 > [...] sustain 113MB/S to the drive. Using hdparm I measure 107MB/S
54 > or higher for all 5 drives [...]
55 > The SDD on it's own PCI Express controller clocks in at about 250MB/S
56 > for reads.
57
58 OK.
59
60 But there's a caveat on the measured "spinning rust" speeds. You're
61 effectively getting "near best case".
62
63 I suppose you're familiar with absolute velocity vs rotational velocity
64 vs distance from center. Think merry-go-round as a kid or crack-the-whip
65 as a teen (or insert your own experience here). The closer to the center
66 you are the slower you go at the same rotational speed (RPM).
67 Conversely, the farther from the center you are, the faster you're
68 actually moving at the same RPM.
69
70 Rotational disk data I/O rates have a similar effect -- data toward the
71 outside edge of the platter (beginning of the disk) is faster to read/
72 write, while data toward the inside edge (center) is slower.
73
74 Based on my own hddparm tests on partitioned drives where I knew the
75 location of the partition, vs. the results for the drive as a whole, the
76 speed reported for rotational drives as a whole, is the speed near the
77 outside edge (beginning of the disk).
78
79 Thus, it'd be rather interesting to partition up one of those drives with
80 a small partition at the beginning and another at the end, and do an
81 hdparm -t of each, as well as of the whole disk. I bet you'd find the
82 one at the end reports rather lower numbers, while the report for the
83 drive as a whole is similar to that of the partition near the beginning
84 of the drive, much faster.
85
86 A good SSD won't have this same sort of variance, since it's SSD and the
87 latency to any of its flash, at least as presented by the firmware which
88 should deal with any variance as it distributes wear, should be similar.
89 (Cheap SSDs and standard USB thumbdrive flash storage works differently,
90 however. Often they assume FAT and have a small amount of fast and
91 resilient but expensive SLC flash at the beginning, where the FAT would
92 be, with the rest of the device much slower and less resilient to rewrite
93 but far cheaper MLC. I was just reading about this recently as I
94 researched my own SSDs.)
95
96 > TESTING: I'm using dd to test. It gives an easy to read anyway result
97 > and seems to be used a lot. I can use bonnie++ or IOzone later but I
98 > don't think that's necessary quite yet.
99
100 Agreed.
101
102 > Being that I have 24GB and don't
103 > want cached data to effect the test speeds I do the following:
104 >
105 > 1) Using dd I created a 50GB file for copying using the following
106 > commands:
107 >
108 > cd /mnt/fastVM
109 > dd if=/dev/random of=random1 bs=1000 count=0 seek=$[1000*1000*50]
110
111 It'd be interesting to see what the reported speed is here... See below
112 for more.
113
114 > 2) To ensure that nothing is cached and the copies are (hopefully)
115 > completely fair as root I do the following between each test:
116 >
117 > sync free -h
118 > echo 3 > /proc/sys/vm/drop_caches
119 > free -h
120
121 Good job. =:^)
122
123 > 3) As a first test I copy using dd the 50GB file from the SDD to the
124 > RAID6.
125
126 OK, that answered the question I had about where that file you created
127 actually was -- on the SSD.
128
129 > As long as reading the SDD is much faster than writing the RAID6
130 > then it should be a test of primarily the RAID6 write speed:
131 >
132 > dd if=/mnt/fastVM/random1 of=SDDCopy
133 > 97656250+0 records in 97656250+0 records out
134 > 50000000000 bytes (50 GB) copied, 339.173 s, 147 MB/s
135
136 > If I clear cache as above and rerun the test it's always 145-155MB/S
137
138 ... Assuming $PWD is now on the raid. You had the path shown too, which
139 I snipped, but that doesn't tell /me/ (as opposed to you, who should know
140 based on your mounts) anything about whether it's on the raid or not.
141 However, the above including the drop-caches demonstrates enough care
142 that I'm quite confident you'd not make /that/ mistake.
143
144 > 4) As a second test I read from the RAID6 and write back to the RAID6.
145 > I see MUCH lower speeds, again repeatable:
146 >
147 > dd if=SDDCopy of=HDDWrite
148 > 97656250+0 records in 97656250+0 records out
149 > 50000000000 bytes (50 GB) copied, 1187.07 s, 42.1 MB/s
150
151 > 5) As a final test, and just looking for problems if any, I do an SDD to
152 > SDD copy which clocked in at close to 200MB/S
153 >
154 > dd if=random1 of=SDDCopy
155 > 97656250+0 records in 97656250+0 records out
156 > 50000000000 bytes (50 GB) copied, 251.105 s, 199 MB/s
157
158 > So, being that this RAID6 was grown yesterday from something that
159 > has existed for a year or two I'm not sure of it's fragmentation, or
160 > even how to determine that at this time. However it seems my problem are
161 > RAID6 reads, not RAID6 writes, at least to new an probably never used
162 > disk space.
163
164 Reading all that, one question occurs to me. If you want to test read
165 and write separately, why the intermediate step of dd-ing from /dev/
166 random to ssd, then from ssd to raid or ssd?
167
168 Why not do direct dd if=/dev/random (or urandom, see note below)
169 of=/desired/target ... for write tests, and then (after dropping caches),
170 if=/desired/target of=/dev/null ... for read tests? That way there's
171 just the one block device involved, not both.
172
173 /dev/random note: I presume with that hardware you have one of the newer
174 CPUs with the new Intel hardware random instruction, with the appropriate
175 kernel config hooking it into /dev/random, and/or otherwise have
176 /dev/random hooked up to a hardware random number generator. Otherwise,
177 using that much random data could block until more suitably random data
178 is generated from approved kernel sources. Thus, the following probably
179 doesn't apply to you, but it may well apply to others, and is good
180 practice in any case, unless you KNOW your random isn't going to block
181 due to hardware generation, and even then it's worth noting that when
182 you're posting examples like the above.
183
184 In general, for tests such as this where a LOT of random data is needed,
185 but cryptographic-quality random isn't necessarily required, use
186 /dev/urandom. In the event that real-random data gets too low,
187 /dev/urandom will switch to pseudo-random generation, which should be
188 "good enough" for this sort of usage. /dev/random, OTOH, will block
189 until it gets more random data from sources the kernel trusts to be truly
190 random. On some machines with relatively limited sources of randomness
191 the kernel considers truly random, therefore, just grabbing 50 GB of data
192 from /dev/random could take QUITE some time (days maybe? I don't know).
193
194 Obviously you don't have /too/ big a problem with it as you got the data
195 from /dev/random, but it's worth noting. If your machine has a hardware
196 random generator hooked into /dev/random, then /dev/urandom will never
197 switch to pseudo-random in any case, so for tests of anything above
198 /kilobytes/ of random data (and even at that...), just use urandom and
199 you won't have to worry about it either way. OTOH, if you're generating
200 an SSH key or something, always use /dev/random as that needs
201 cryptographic security level randomness, but that'll take just a few
202 bytes of randomness, not kilobytes let alone gigabytes, and if your
203 hardware doesn't have good randomness and it does block, wiggling your
204 mouse around a bit (obviously assumes a local command, remote could
205 require something other than mouse, obviously) should give it enough
206 randomness to continue.
207
208
209 Meanwhile, dd-ing either from /dev/urandom as source, or to /dev/null as
210 sink, with only the test-target block device as a real block device,
211 should give you "purer" read-only and write-only tests. In theory it
212 shouldn't matter much given your method of testing, but as we all know,
213 theory and reality aren't always well aligned.
214
215
216 Of course the next question follows on from the above. I see a write to
217 the raid, and a copy from the raid to the raid, so read/write on the
218 raid, and a copy from the ssd to the ssd, read/write on it, but no test
219 of from the raid read.
220
221 So
222
223 if=/dev/urandom of=/mnt/raid/target ... should give you raid write.
224
225 drop-caches
226
227 if=/mnt/raid/target of=/dev/null ... should give you raid read.
228
229 *THEN* we have good numbers on both to compare the raid read/write to.
230
231 What I suspect you'll find, unless fragmentation IS your problem, is that
232 both read (from the raid) alone and write (to the raid) alone should be
233 much faster than read/write (from/to the raid).
234
235 The problem with read/write is that you're on "rotating rust" hardware
236 and there's some latency as it repositions the heads from the read
237 location to the write location and back.
238
239 If I'm correct and that's what you find, a workaround specific to dd
240 would be to specify a much larger block size, so it reads in far more
241 data at once, then writes it out at once, with far fewer switches between
242 modes. In the above you didn't specify bs (or the separate input/output
243 equivilents, ibs/obs respectively) at all, so it's using 512-byte
244 blocksize defaults.
245
246 From what I know of hardware, 64KB is a standard read-ahead, so in theory
247 you should see improvements using larger block sizes upto at LEAST that
248 size, and on a 5-disk raid6, probably 3X that, 192KB, which should in
249 theory do a full 64KB buffer on each of the three data drives of the 5-
250 way raid6 (the other two being parity).
251
252 I'm guessing you'll see a "knee" at the 192 KB (that's 2^10 power not
253 10^3 power BTW) block size, and above that you might see improvement, but
254 not near as much, since the hardware should be doing full 64KB blocks
255 which it's optimized to. There's likely to be another knee at the 16MB
256 point (again, power of two, not 10), or more accurately, the 48MB point
257 (3*16MB), since that's the size of the device hardware buffers (again,
258 three devices worth of data-stripe, since the other two are parity,
259 3*16MB=48MB). Above that, theory says you'll see even less improvement,
260 since the caches will be full and any improvement still seen should be
261 purely that of less switches between read/write mode and thus less seeks.
262
263 But it'd be interesting to see how closely theory matches reality,
264 there's very possibly a fly in that theoretical ointment somewhere. =:^\
265
266 Of course configurable block size is specific to dd. Real life file
267 transfers may well be quite a different story. That's where the chunk
268 size, stripe size, etc, stuff comes in, setting the defaults for the
269 kernel for that device, and again, I'll freely admit to not knowing as
270 much as I could in that area.
271
272 > I will also report more later but I can state that just using top
273 > there's never much CPU usage doing this but a LOT of WAIT time when
274 > reading the RAID6. It really appears the system is spinning it's wheels
275 > waiting for the RAID to get data from the disk.
276
277 When you're dealing with spinning rust, any time you have a transfer of
278 any size (certainly GB), you WILL see high wait times. Disks are simply
279 SLOW. Even SSDs are no match for system memory, tho their enough closer
280 to help a lot and can be close enough that the bottleneck is elsewhere.
281 (Modern SSDs saturate the SATA-600 links with thruput above 500 MByte/
282 sec, making the SATA-600 bus the bottleneck, or the 1x PCI-E 2.xlink if
283 that's what it's running on, since they saturate at 485MByte/sec or so,
284 tho PCI-E 3.x is double that so nearly a GByte/sec and a single SATA-600
285 won't saturate that. Modern DDR3 SDRAM by comparision runs 10+ GByte/sec
286 LOW end, two orders of magnitude faster. Numbers fresh from wikipedia,
287 BTW.)
288
289 > One place where I wanted to double check your thinking. My thought
290 > is that a RAID1 will _NEVER_ outperform the hdparm -tT read speeds as it
291 > has to read from three drives and make sure they are all good before
292 > returning data to the user. I don't see how that could ever be faster
293 > than what a single drive file system could do which for these drives
294 > would be the 113MB/S WD spec number, correct? As I'm currently getting
295 > 145MB/S it appears on the surface that the RAID6 is providing some
296 > value, at least in these early days of use. Maybe it will degrade over
297 > time though.
298
299 As someone else already posted, that's NOT correct. Neither raid1 nor
300 raid6, at least the mdraid implementations, verify the data. Raid1
301 doesn't have parity at all, just many copies, and raid6 has parity but
302 only uses it for rebuilds, NOT to check data integrity under normal usage
303 -- it too simply reads the data and returns it.
304
305 What raid1 does (when it's getting short reads only one at a time) is
306 send the request to every spindle. The first one that returns the data
307 wins; the others simply get their returns thrown away.
308
309 So under small-one-at-a-time reading conditions, the speed of raid1 reads
310 should be the speed of the fastest disk in the bunch.
311
312 The raid1 read advantage is in the fact that there's often more than one
313 read going on at once, or that the read is big enough to split up, so
314 different spindles can be seeking to and reading different parts of the
315 request in parallel. (This also helps in fragmented file conditions as
316 long as fragmentation isn't overwhelming, since a raid1 can then send
317 different spindle heads to read the different segments in parallel,
318 instead of reading one at a time serially, as it would have to do in a
319 single spindle case.)
320
321 In theory, the stripes of raid6 /can/ lead to better thruput for reads.
322 In fact, my experience both with raid6 and with raid0 demonstrates that
323 not to be the case as often as one might expect, due either to small
324 reads or due to fragmentation breaking up the big reads thus negating the
325 theoretical thruput advantage of multiple stripes.
326
327 To be fair, my raid0 experience was as I mentioned earlier, with files I
328 could easily redownload from the net, mostly the portage tree and
329 overlays, along with the kernel git tree. Due to the frequency of update
330 and the fast rate of change as well as the small files, fragmentation was
331 quite a problem, and the files were small enough I likely wouldn't have
332 seen the full benefit of the 4-way raid0 stripes in any case, so that
333 wasn't a best-case test scenario. But it's what one practically puts on
334 raid0, because it IS easily redownloaded from the net, so it DOESN'T
335 matter that a loss of any of the raid0 component devices will kill the
336 entire thing.
337
338 If I'd have been using the raid0 for much bigger media files, mp3s or
339 video of megabytes in size minimum, that get saved and never changed so
340 there's little fragmentation, I expect my raid0 experience would have
341 been *FAR* better. But at the same time, that's not the type of data
342 that it generally makes SENSE to store on a raid0 without backups or
343 redundancy of any sort, unless it's simply VDR files that if a device
344 drops from the raid and you lose it you don't particularly care (which
345 would make a GREAT raid0 candidate), so...
346
347 Raid6 is the stripes of raid0, plus two-way-parity. So since the parity
348 is ignored for reads, for them it's effectively raid0 with two less
349 stripes then the number of devices. Thus your 5-device raid6 is
350 effectively a 3-device raid0 in terms of reads. In theory, thruput for
351 large reads done by themselves should be pretty good -- three times that
352 of a single device. In fact... due either to multiple jobs happening at
353 once, or to a mix of read/write happening at once, or to fragmentation, I
354 was disappointed, and far happier with raid1.
355
356 But your situation is indeed rather different than mine, and depending on
357 how much writing happens in those big VM files and how the filesystem you
358 choose handles fragmentation, you could be rather happier with raid6 than
359 I was.
360
361 But I'd still suggest you try raid1 if the amount of data you're handling
362 will let you. Honestly, it surprised me how well raid1 did for me. I
363 wasn't prepared for that at all, and I believe that in comparison to what
364 I was getting on raid6 is what colored my opinion of raid6 so badly. I
365 had NO IDEA there would be that much difference! But your experience may
366 indeed be different. The only way to know is to try it.
367
368 However, one thing I either overlooked or that hasn't been posted yet is
369 just how much data you're talking about. You're running five 500-gig
370 drives in raid6 now, which should give you 3*500=1500 gigs (10-power)
371 capacity.
372
373 If it's under a third full, 500 MB (10-power), you can go raid1 with as
374 many mirrors as you like of the five, and keep the rest of them for hot-
375 spares or whatever.
376
377 If you're running (or plan to be running) near capacity, over 2/3 full, 1
378 TB (10-power), you really don't have much option but raid6.
379
380 If you're in between, 1/3 to 2/3 full, 500-1000 GB (10-power), then a
381 raid10 is possible, perhaps 4-spindle with the 5th as a hot-spare.
382
383 (A spindle configured as a hot-spare is kept unused but ready for use by
384 mdadm and the kernel. If a spindle should drop out, the hot-spare is
385 automatically inserted in its place and a rebuild immediately started.
386 This narrows the danger zone during which you're degraded and at risk if
387 further spindles drop out, because handling is automatic so you're back
388 to full un-degraded as soon as possible. However, it doesn't eliminate
389 that danger zone should another one drop out during the rebuild, which is
390 after all quite stressful on the remaining drives since due to all that
391 reading going on, so the risk is greater during a rebuild than under
392 normal operation.)
393
394 So if you're over 2/3 full, or expect to be in short order, there's
395 little sense in further debate on at least /your/ raid6, as that's pretty
396 much what you're stuck with. (Unless you can categorize some data as
397 more important than other, and raid it, while the other can be considered
398 worth the risk of loss if the device goes, in which case we're back in
399 play with other options once again.)
400
401 --
402 Duncan - List replies preferred. No HTML msgs.
403 "Every nonfree program has a lord, a master --
404 and if you use the program, he is your master." Richard Stallman

Replies