1 |
Mark Knecht posted on Fri, 21 Jun 2013 10:40:48 -0700 as excerpted: |
2 |
|
3 |
> On Fri, Jun 21, 2013 at 12:31 AM, Duncan <1i5t5.duncan@×××.net> wrote: |
4 |
> <SNIP> |
5 |
> |
6 |
> Wonderful post but much too long to carry on a conversation |
7 |
> in-line. |
8 |
|
9 |
FWIW... I'd have a hard time doing much of anything else, these days, no |
10 |
matter the size. Otherwise, I'd be likely to forget a point. But I do |
11 |
try to snip or summarize when possible. And I do understand your choice |
12 |
and agree with it for you. It's just not one I'd find workable for me... |
13 |
which is why I'm back to inline, here. |
14 |
|
15 |
> As you sound pretty sure of your understanding/history I'll |
16 |
> assume you're right 100% of the time, but only maybe 80% of the post |
17 |
> feels right to me at this time so let's assume I have much to learn and |
18 |
> go from there. |
19 |
|
20 |
That's a very nice way of saying "I'll have to verify that before I can |
21 |
fully agree, but we'll go with it for now." I'll have to remember it! |
22 |
=:^) |
23 |
|
24 |
> In thinking about this issue this morning I think it's important to |
25 |
> me to get down to basics and verify as much as possible, step-by-step, |
26 |
> so that I don't layer good work on top of bad assumptions. |
27 |
|
28 |
Extremely reasonable approach. =:^) |
29 |
|
30 |
> Basic Machine - ASUS Rampage II Extreme motherboard (4/1/2010) + 24GB |
31 |
> DDR3 + Core i7-980x Extreme 12 core processor |
32 |
|
33 |
That's a very impressive base. But as you point out elsewhere, you use |
34 |
it. Multiple VMs running MS should well use use both the dozen cores and |
35 |
the 24 gig RAM. |
36 |
|
37 |
As an aside, it's interesting how well your dozen cores, 24 gig RAM, fits |
38 |
my basic two gigs a core rule of thumb. Obviously I'd consider that |
39 |
reasonably well balanced RAM/cores-wise. |
40 |
|
41 |
> 1 SDD - 120GB SATA3 on it's own controller |
42 |
> 5+ HDD - WD5002ABYS RAID Edition 3 SATA3 drives |
43 |
> using Intel integrated controllers |
44 |
> |
45 |
> (NOTE: I can possibly go to a 6-drive RAID if I made some changes in the |
46 |
> box but that's for later) |
47 |
> |
48 |
> According to the WD spec |
49 |
> (http://www.wdc.com/en/library/spec/2879-701281.pdf) the 500GB drives |
50 |
|
51 |
OK, single 120 gig main drive (SSD), 5 half-TB drives for the raid. |
52 |
|
53 |
> [...] sustain 113MB/S to the drive. Using hdparm I measure 107MB/S |
54 |
> or higher for all 5 drives [...] |
55 |
> The SDD on it's own PCI Express controller clocks in at about 250MB/S |
56 |
> for reads. |
57 |
|
58 |
OK. |
59 |
|
60 |
But there's a caveat on the measured "spinning rust" speeds. You're |
61 |
effectively getting "near best case". |
62 |
|
63 |
I suppose you're familiar with absolute velocity vs rotational velocity |
64 |
vs distance from center. Think merry-go-round as a kid or crack-the-whip |
65 |
as a teen (or insert your own experience here). The closer to the center |
66 |
you are the slower you go at the same rotational speed (RPM). |
67 |
Conversely, the farther from the center you are, the faster you're |
68 |
actually moving at the same RPM. |
69 |
|
70 |
Rotational disk data I/O rates have a similar effect -- data toward the |
71 |
outside edge of the platter (beginning of the disk) is faster to read/ |
72 |
write, while data toward the inside edge (center) is slower. |
73 |
|
74 |
Based on my own hddparm tests on partitioned drives where I knew the |
75 |
location of the partition, vs. the results for the drive as a whole, the |
76 |
speed reported for rotational drives as a whole, is the speed near the |
77 |
outside edge (beginning of the disk). |
78 |
|
79 |
Thus, it'd be rather interesting to partition up one of those drives with |
80 |
a small partition at the beginning and another at the end, and do an |
81 |
hdparm -t of each, as well as of the whole disk. I bet you'd find the |
82 |
one at the end reports rather lower numbers, while the report for the |
83 |
drive as a whole is similar to that of the partition near the beginning |
84 |
of the drive, much faster. |
85 |
|
86 |
A good SSD won't have this same sort of variance, since it's SSD and the |
87 |
latency to any of its flash, at least as presented by the firmware which |
88 |
should deal with any variance as it distributes wear, should be similar. |
89 |
(Cheap SSDs and standard USB thumbdrive flash storage works differently, |
90 |
however. Often they assume FAT and have a small amount of fast and |
91 |
resilient but expensive SLC flash at the beginning, where the FAT would |
92 |
be, with the rest of the device much slower and less resilient to rewrite |
93 |
but far cheaper MLC. I was just reading about this recently as I |
94 |
researched my own SSDs.) |
95 |
|
96 |
> TESTING: I'm using dd to test. It gives an easy to read anyway result |
97 |
> and seems to be used a lot. I can use bonnie++ or IOzone later but I |
98 |
> don't think that's necessary quite yet. |
99 |
|
100 |
Agreed. |
101 |
|
102 |
> Being that I have 24GB and don't |
103 |
> want cached data to effect the test speeds I do the following: |
104 |
> |
105 |
> 1) Using dd I created a 50GB file for copying using the following |
106 |
> commands: |
107 |
> |
108 |
> cd /mnt/fastVM |
109 |
> dd if=/dev/random of=random1 bs=1000 count=0 seek=$[1000*1000*50] |
110 |
|
111 |
It'd be interesting to see what the reported speed is here... See below |
112 |
for more. |
113 |
|
114 |
> 2) To ensure that nothing is cached and the copies are (hopefully) |
115 |
> completely fair as root I do the following between each test: |
116 |
> |
117 |
> sync free -h |
118 |
> echo 3 > /proc/sys/vm/drop_caches |
119 |
> free -h |
120 |
|
121 |
Good job. =:^) |
122 |
|
123 |
> 3) As a first test I copy using dd the 50GB file from the SDD to the |
124 |
> RAID6. |
125 |
|
126 |
OK, that answered the question I had about where that file you created |
127 |
actually was -- on the SSD. |
128 |
|
129 |
> As long as reading the SDD is much faster than writing the RAID6 |
130 |
> then it should be a test of primarily the RAID6 write speed: |
131 |
> |
132 |
> dd if=/mnt/fastVM/random1 of=SDDCopy |
133 |
> 97656250+0 records in 97656250+0 records out |
134 |
> 50000000000 bytes (50 GB) copied, 339.173 s, 147 MB/s |
135 |
|
136 |
> If I clear cache as above and rerun the test it's always 145-155MB/S |
137 |
|
138 |
... Assuming $PWD is now on the raid. You had the path shown too, which |
139 |
I snipped, but that doesn't tell /me/ (as opposed to you, who should know |
140 |
based on your mounts) anything about whether it's on the raid or not. |
141 |
However, the above including the drop-caches demonstrates enough care |
142 |
that I'm quite confident you'd not make /that/ mistake. |
143 |
|
144 |
> 4) As a second test I read from the RAID6 and write back to the RAID6. |
145 |
> I see MUCH lower speeds, again repeatable: |
146 |
> |
147 |
> dd if=SDDCopy of=HDDWrite |
148 |
> 97656250+0 records in 97656250+0 records out |
149 |
> 50000000000 bytes (50 GB) copied, 1187.07 s, 42.1 MB/s |
150 |
|
151 |
> 5) As a final test, and just looking for problems if any, I do an SDD to |
152 |
> SDD copy which clocked in at close to 200MB/S |
153 |
> |
154 |
> dd if=random1 of=SDDCopy |
155 |
> 97656250+0 records in 97656250+0 records out |
156 |
> 50000000000 bytes (50 GB) copied, 251.105 s, 199 MB/s |
157 |
|
158 |
> So, being that this RAID6 was grown yesterday from something that |
159 |
> has existed for a year or two I'm not sure of it's fragmentation, or |
160 |
> even how to determine that at this time. However it seems my problem are |
161 |
> RAID6 reads, not RAID6 writes, at least to new an probably never used |
162 |
> disk space. |
163 |
|
164 |
Reading all that, one question occurs to me. If you want to test read |
165 |
and write separately, why the intermediate step of dd-ing from /dev/ |
166 |
random to ssd, then from ssd to raid or ssd? |
167 |
|
168 |
Why not do direct dd if=/dev/random (or urandom, see note below) |
169 |
of=/desired/target ... for write tests, and then (after dropping caches), |
170 |
if=/desired/target of=/dev/null ... for read tests? That way there's |
171 |
just the one block device involved, not both. |
172 |
|
173 |
/dev/random note: I presume with that hardware you have one of the newer |
174 |
CPUs with the new Intel hardware random instruction, with the appropriate |
175 |
kernel config hooking it into /dev/random, and/or otherwise have |
176 |
/dev/random hooked up to a hardware random number generator. Otherwise, |
177 |
using that much random data could block until more suitably random data |
178 |
is generated from approved kernel sources. Thus, the following probably |
179 |
doesn't apply to you, but it may well apply to others, and is good |
180 |
practice in any case, unless you KNOW your random isn't going to block |
181 |
due to hardware generation, and even then it's worth noting that when |
182 |
you're posting examples like the above. |
183 |
|
184 |
In general, for tests such as this where a LOT of random data is needed, |
185 |
but cryptographic-quality random isn't necessarily required, use |
186 |
/dev/urandom. In the event that real-random data gets too low, |
187 |
/dev/urandom will switch to pseudo-random generation, which should be |
188 |
"good enough" for this sort of usage. /dev/random, OTOH, will block |
189 |
until it gets more random data from sources the kernel trusts to be truly |
190 |
random. On some machines with relatively limited sources of randomness |
191 |
the kernel considers truly random, therefore, just grabbing 50 GB of data |
192 |
from /dev/random could take QUITE some time (days maybe? I don't know). |
193 |
|
194 |
Obviously you don't have /too/ big a problem with it as you got the data |
195 |
from /dev/random, but it's worth noting. If your machine has a hardware |
196 |
random generator hooked into /dev/random, then /dev/urandom will never |
197 |
switch to pseudo-random in any case, so for tests of anything above |
198 |
/kilobytes/ of random data (and even at that...), just use urandom and |
199 |
you won't have to worry about it either way. OTOH, if you're generating |
200 |
an SSH key or something, always use /dev/random as that needs |
201 |
cryptographic security level randomness, but that'll take just a few |
202 |
bytes of randomness, not kilobytes let alone gigabytes, and if your |
203 |
hardware doesn't have good randomness and it does block, wiggling your |
204 |
mouse around a bit (obviously assumes a local command, remote could |
205 |
require something other than mouse, obviously) should give it enough |
206 |
randomness to continue. |
207 |
|
208 |
|
209 |
Meanwhile, dd-ing either from /dev/urandom as source, or to /dev/null as |
210 |
sink, with only the test-target block device as a real block device, |
211 |
should give you "purer" read-only and write-only tests. In theory it |
212 |
shouldn't matter much given your method of testing, but as we all know, |
213 |
theory and reality aren't always well aligned. |
214 |
|
215 |
|
216 |
Of course the next question follows on from the above. I see a write to |
217 |
the raid, and a copy from the raid to the raid, so read/write on the |
218 |
raid, and a copy from the ssd to the ssd, read/write on it, but no test |
219 |
of from the raid read. |
220 |
|
221 |
So |
222 |
|
223 |
if=/dev/urandom of=/mnt/raid/target ... should give you raid write. |
224 |
|
225 |
drop-caches |
226 |
|
227 |
if=/mnt/raid/target of=/dev/null ... should give you raid read. |
228 |
|
229 |
*THEN* we have good numbers on both to compare the raid read/write to. |
230 |
|
231 |
What I suspect you'll find, unless fragmentation IS your problem, is that |
232 |
both read (from the raid) alone and write (to the raid) alone should be |
233 |
much faster than read/write (from/to the raid). |
234 |
|
235 |
The problem with read/write is that you're on "rotating rust" hardware |
236 |
and there's some latency as it repositions the heads from the read |
237 |
location to the write location and back. |
238 |
|
239 |
If I'm correct and that's what you find, a workaround specific to dd |
240 |
would be to specify a much larger block size, so it reads in far more |
241 |
data at once, then writes it out at once, with far fewer switches between |
242 |
modes. In the above you didn't specify bs (or the separate input/output |
243 |
equivilents, ibs/obs respectively) at all, so it's using 512-byte |
244 |
blocksize defaults. |
245 |
|
246 |
From what I know of hardware, 64KB is a standard read-ahead, so in theory |
247 |
you should see improvements using larger block sizes upto at LEAST that |
248 |
size, and on a 5-disk raid6, probably 3X that, 192KB, which should in |
249 |
theory do a full 64KB buffer on each of the three data drives of the 5- |
250 |
way raid6 (the other two being parity). |
251 |
|
252 |
I'm guessing you'll see a "knee" at the 192 KB (that's 2^10 power not |
253 |
10^3 power BTW) block size, and above that you might see improvement, but |
254 |
not near as much, since the hardware should be doing full 64KB blocks |
255 |
which it's optimized to. There's likely to be another knee at the 16MB |
256 |
point (again, power of two, not 10), or more accurately, the 48MB point |
257 |
(3*16MB), since that's the size of the device hardware buffers (again, |
258 |
three devices worth of data-stripe, since the other two are parity, |
259 |
3*16MB=48MB). Above that, theory says you'll see even less improvement, |
260 |
since the caches will be full and any improvement still seen should be |
261 |
purely that of less switches between read/write mode and thus less seeks. |
262 |
|
263 |
But it'd be interesting to see how closely theory matches reality, |
264 |
there's very possibly a fly in that theoretical ointment somewhere. =:^\ |
265 |
|
266 |
Of course configurable block size is specific to dd. Real life file |
267 |
transfers may well be quite a different story. That's where the chunk |
268 |
size, stripe size, etc, stuff comes in, setting the defaults for the |
269 |
kernel for that device, and again, I'll freely admit to not knowing as |
270 |
much as I could in that area. |
271 |
|
272 |
> I will also report more later but I can state that just using top |
273 |
> there's never much CPU usage doing this but a LOT of WAIT time when |
274 |
> reading the RAID6. It really appears the system is spinning it's wheels |
275 |
> waiting for the RAID to get data from the disk. |
276 |
|
277 |
When you're dealing with spinning rust, any time you have a transfer of |
278 |
any size (certainly GB), you WILL see high wait times. Disks are simply |
279 |
SLOW. Even SSDs are no match for system memory, tho their enough closer |
280 |
to help a lot and can be close enough that the bottleneck is elsewhere. |
281 |
(Modern SSDs saturate the SATA-600 links with thruput above 500 MByte/ |
282 |
sec, making the SATA-600 bus the bottleneck, or the 1x PCI-E 2.xlink if |
283 |
that's what it's running on, since they saturate at 485MByte/sec or so, |
284 |
tho PCI-E 3.x is double that so nearly a GByte/sec and a single SATA-600 |
285 |
won't saturate that. Modern DDR3 SDRAM by comparision runs 10+ GByte/sec |
286 |
LOW end, two orders of magnitude faster. Numbers fresh from wikipedia, |
287 |
BTW.) |
288 |
|
289 |
> One place where I wanted to double check your thinking. My thought |
290 |
> is that a RAID1 will _NEVER_ outperform the hdparm -tT read speeds as it |
291 |
> has to read from three drives and make sure they are all good before |
292 |
> returning data to the user. I don't see how that could ever be faster |
293 |
> than what a single drive file system could do which for these drives |
294 |
> would be the 113MB/S WD spec number, correct? As I'm currently getting |
295 |
> 145MB/S it appears on the surface that the RAID6 is providing some |
296 |
> value, at least in these early days of use. Maybe it will degrade over |
297 |
> time though. |
298 |
|
299 |
As someone else already posted, that's NOT correct. Neither raid1 nor |
300 |
raid6, at least the mdraid implementations, verify the data. Raid1 |
301 |
doesn't have parity at all, just many copies, and raid6 has parity but |
302 |
only uses it for rebuilds, NOT to check data integrity under normal usage |
303 |
-- it too simply reads the data and returns it. |
304 |
|
305 |
What raid1 does (when it's getting short reads only one at a time) is |
306 |
send the request to every spindle. The first one that returns the data |
307 |
wins; the others simply get their returns thrown away. |
308 |
|
309 |
So under small-one-at-a-time reading conditions, the speed of raid1 reads |
310 |
should be the speed of the fastest disk in the bunch. |
311 |
|
312 |
The raid1 read advantage is in the fact that there's often more than one |
313 |
read going on at once, or that the read is big enough to split up, so |
314 |
different spindles can be seeking to and reading different parts of the |
315 |
request in parallel. (This also helps in fragmented file conditions as |
316 |
long as fragmentation isn't overwhelming, since a raid1 can then send |
317 |
different spindle heads to read the different segments in parallel, |
318 |
instead of reading one at a time serially, as it would have to do in a |
319 |
single spindle case.) |
320 |
|
321 |
In theory, the stripes of raid6 /can/ lead to better thruput for reads. |
322 |
In fact, my experience both with raid6 and with raid0 demonstrates that |
323 |
not to be the case as often as one might expect, due either to small |
324 |
reads or due to fragmentation breaking up the big reads thus negating the |
325 |
theoretical thruput advantage of multiple stripes. |
326 |
|
327 |
To be fair, my raid0 experience was as I mentioned earlier, with files I |
328 |
could easily redownload from the net, mostly the portage tree and |
329 |
overlays, along with the kernel git tree. Due to the frequency of update |
330 |
and the fast rate of change as well as the small files, fragmentation was |
331 |
quite a problem, and the files were small enough I likely wouldn't have |
332 |
seen the full benefit of the 4-way raid0 stripes in any case, so that |
333 |
wasn't a best-case test scenario. But it's what one practically puts on |
334 |
raid0, because it IS easily redownloaded from the net, so it DOESN'T |
335 |
matter that a loss of any of the raid0 component devices will kill the |
336 |
entire thing. |
337 |
|
338 |
If I'd have been using the raid0 for much bigger media files, mp3s or |
339 |
video of megabytes in size minimum, that get saved and never changed so |
340 |
there's little fragmentation, I expect my raid0 experience would have |
341 |
been *FAR* better. But at the same time, that's not the type of data |
342 |
that it generally makes SENSE to store on a raid0 without backups or |
343 |
redundancy of any sort, unless it's simply VDR files that if a device |
344 |
drops from the raid and you lose it you don't particularly care (which |
345 |
would make a GREAT raid0 candidate), so... |
346 |
|
347 |
Raid6 is the stripes of raid0, plus two-way-parity. So since the parity |
348 |
is ignored for reads, for them it's effectively raid0 with two less |
349 |
stripes then the number of devices. Thus your 5-device raid6 is |
350 |
effectively a 3-device raid0 in terms of reads. In theory, thruput for |
351 |
large reads done by themselves should be pretty good -- three times that |
352 |
of a single device. In fact... due either to multiple jobs happening at |
353 |
once, or to a mix of read/write happening at once, or to fragmentation, I |
354 |
was disappointed, and far happier with raid1. |
355 |
|
356 |
But your situation is indeed rather different than mine, and depending on |
357 |
how much writing happens in those big VM files and how the filesystem you |
358 |
choose handles fragmentation, you could be rather happier with raid6 than |
359 |
I was. |
360 |
|
361 |
But I'd still suggest you try raid1 if the amount of data you're handling |
362 |
will let you. Honestly, it surprised me how well raid1 did for me. I |
363 |
wasn't prepared for that at all, and I believe that in comparison to what |
364 |
I was getting on raid6 is what colored my opinion of raid6 so badly. I |
365 |
had NO IDEA there would be that much difference! But your experience may |
366 |
indeed be different. The only way to know is to try it. |
367 |
|
368 |
However, one thing I either overlooked or that hasn't been posted yet is |
369 |
just how much data you're talking about. You're running five 500-gig |
370 |
drives in raid6 now, which should give you 3*500=1500 gigs (10-power) |
371 |
capacity. |
372 |
|
373 |
If it's under a third full, 500 MB (10-power), you can go raid1 with as |
374 |
many mirrors as you like of the five, and keep the rest of them for hot- |
375 |
spares or whatever. |
376 |
|
377 |
If you're running (or plan to be running) near capacity, over 2/3 full, 1 |
378 |
TB (10-power), you really don't have much option but raid6. |
379 |
|
380 |
If you're in between, 1/3 to 2/3 full, 500-1000 GB (10-power), then a |
381 |
raid10 is possible, perhaps 4-spindle with the 5th as a hot-spare. |
382 |
|
383 |
(A spindle configured as a hot-spare is kept unused but ready for use by |
384 |
mdadm and the kernel. If a spindle should drop out, the hot-spare is |
385 |
automatically inserted in its place and a rebuild immediately started. |
386 |
This narrows the danger zone during which you're degraded and at risk if |
387 |
further spindles drop out, because handling is automatic so you're back |
388 |
to full un-degraded as soon as possible. However, it doesn't eliminate |
389 |
that danger zone should another one drop out during the rebuild, which is |
390 |
after all quite stressful on the remaining drives since due to all that |
391 |
reading going on, so the risk is greater during a rebuild than under |
392 |
normal operation.) |
393 |
|
394 |
So if you're over 2/3 full, or expect to be in short order, there's |
395 |
little sense in further debate on at least /your/ raid6, as that's pretty |
396 |
much what you're stuck with. (Unless you can categorize some data as |
397 |
more important than other, and raid it, while the other can be considered |
398 |
worth the risk of loss if the device goes, in which case we're back in |
399 |
play with other options once again.) |
400 |
|
401 |
-- |
402 |
Duncan - List replies preferred. No HTML msgs. |
403 |
"Every nonfree program has a lord, a master -- |
404 |
and if you use the program, he is your master." Richard Stallman |