1 |
Mark Knecht posted on Thu, 20 Jun 2013 12:10:04 -0700 as excerpted: |
2 |
|
3 |
> Does anyone know of info on how the starting sector number might |
4 |
> impact RAID performance under Gentoo? The drives are WD-500G RE3 drives |
5 |
> shown here: |
6 |
> |
7 |
> http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/ |
8 |
B001EMZPD0/ref=cm_cr_pr_product_top |
9 |
> |
10 |
> These are NOT 4k sector sized drives. |
11 |
> |
12 |
> Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My |
13 |
> benchmarking seems abysmal at around 40MB/S using dd copying large |
14 |
> files. |
15 |
> It's higher, around 80MB/S if the file being transferred is coming from |
16 |
> an SSD, but even 80MB/S seems slow to me. I see a LOT of wait time in |
17 |
> top. |
18 |
> And my 'large file' copies might not be large enough as the machine has |
19 |
> 24GB of DRAM and I've only been copying 21GB so it's possible some of |
20 |
> that is cached. |
21 |
|
22 |
I /suspect/ that the problem isn't striping, tho that can be a factor, |
23 |
but rather, your choice of raid6. Note that I personally ran md/raid-6 |
24 |
here for awhile, so I know a bit of what I'm talking about. I didn't |
25 |
realize the full implications of what I was setting up originally, or I'd |
26 |
have not chosen raid6 in the first place, but live and learn as they say, |
27 |
and that I did. |
28 |
|
29 |
General rule, raid6 is abysmal for writing and gets dramatically worse as |
30 |
fragmentation sets in, tho reading is reasonable. The reason is that in |
31 |
ordered to properly parity-check and write out less-than-full-stripe |
32 |
writes, the system must effectively read-in the existing data and merge |
33 |
it with the new data, then recalculate the parity, before writing the new |
34 |
data AND 100% of the (two-way in raid-6) parity. Further, because raid |
35 |
sits below the filesystem level, it knows nothing about what parts of the |
36 |
filesystem are actually used, and must read and write the FULL data |
37 |
stripe (perhaps minus the new data bit, I'm not sure), including parts |
38 |
that will be empty on a freshly formatted filesystem. |
39 |
|
40 |
So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k |
41 |
in data across three devices, and 8k of parity across the other two |
42 |
devices. Now you go to write a 1k file, but in ordered to do so the full |
43 |
12k of existing data must be read in, even on an empty filesystem, |
44 |
because the RAID doesn't know it's empty! Then the new data must be |
45 |
merged in and new checksums created, then the full 20k must be written |
46 |
back out, certainly the 8k of parity, but also likely the full 12k of |
47 |
data even if most of it is simply rewrite, but almost certainly at least |
48 |
the 4k strip on the device the new data is written to. |
49 |
|
50 |
As I said that gets much worse as a filesystem ages, due to fragmentation |
51 |
meaning writes are more often writes to say 3 stripe fragments instead of |
52 |
a single whole stripe. That's what proper stride size, etc, can help |
53 |
with, if the filesystem's reasonably fragmentation resistant, but even |
54 |
then filesystem aging certainly won't /help/. |
55 |
|
56 |
Reads, meanwhile, are reasonable speed (in normal non-degraded mode), |
57 |
because on a raid6 the data is at least two-way striped (on a 4-device |
58 |
raid, your 5-device would be three-way striped data, the other two being |
59 |
parity of course), so you do get moderate striping read bonuses. |
60 |
|
61 |
Then there's all that parity information available and written out at |
62 |
every write, but it's not actually used to check the reliability of the |
63 |
data in normal operation, only to reconstruct if a device or two goes |
64 |
missing. |
65 |
|
66 |
On a well laid out system, I/O to the separate drives at least shouldn't |
67 |
interfere with each other, assuming SATA and a chipset and bus layout |
68 |
that can handle them in parallel, not /that/ big a feat on today's |
69 |
hardware at least as long as you're still doing "spinning rust", as the |
70 |
mechanical drive latency is almost certainly the bottleneck there, and at |
71 |
least that can be parallelized to a reasonable degree across the |
72 |
individual drives. |
73 |
|
74 |
What I ultimately came to realize here is that unless the job at hand is |
75 |
nearly 100% read on the raid, with the caveat that you have enough space, |
76 |
raid1 is almost certainly at least as good if not a better choice. If |
77 |
you have the devices to support it, you can go for raid10/50/60, and a |
78 |
raid10 across 5 devices is certainly possible with mdraid, but a straight |
79 |
raid-6... you're generally better off with an N-way raid-1, for a couple |
80 |
reasons. |
81 |
|
82 |
First, md/raid1 is surprisingly, even astoundingly, good at multi-task |
83 |
scheduling reads. So any time there's multiple I/O read tasks going on |
84 |
(like during boot), raid1 works really well, with the scheduler |
85 |
distributing tasks among the available devices, this minimizing seek- |
86 |
latency. So take a 5-device raid-1, you can very likely accomplish at |
87 |
least 5 and possibly 6 or even 7 read jobs in say 110 or 120% of the time |
88 |
it would take to do just the longest one on a single device, almost |
89 |
certainly well before a single device could have done the two longest |
90 |
read jobs. This also works if there's a single task alternating reads of |
91 |
N different files/directories, since the scheduler will again distribute |
92 |
jobs among the devices, so say one device head stays over the directory |
93 |
information, while another goes to read the first file, the second reads |
94 |
another file, etc, and the heads stay where they are until they're needed |
95 |
elsewhere so the more devices in raid1 you have the more likely it is |
96 |
that more data read from the same location still has a head located right |
97 |
over it and can just read it as the correct portion of the disk spins |
98 |
underneath, instead of first seeking to the correct spot on the disk. |
99 |
|
100 |
It's worth pointing out that in the case of parallel job read access, due |
101 |
to this parallel read-scheduling md/raid1 can often best raid0 |
102 |
performance, despite raid0's technically better single-job thruput |
103 |
numbers. This was something I learned by experience as well, that makes |
104 |
sense but that I had TOTALLY not realized or calculated for in my |
105 |
original setup, as I was running raid0 for things like the gentoo ebuild |
106 |
tree and the kernel sources, since I didn't need redundancy for them. My |
107 |
raid0 performance there was rather disappointing, because both portage |
108 |
tree updates and dep calculation and the kernel build process don't |
109 |
optimize well for thruput, which is what raid0 does, but optimize rather |
110 |
better for parallel I/O, where raid1 shines especially for read. |
111 |
|
112 |
Second, md/raid1 writes, because they happen in parallel with the |
113 |
bottleneck being the spinning rust, basically occur at the speed of the |
114 |
slowest disk. So you don't get N-way parallel job write speed, just |
115 |
single disk speed, but it's still *WAY* *WAY* better than raid6, which |
116 |
has to read in the existing data and do the merge before it can write |
117 |
back out. **THAT'S THE RAID6 PERFORMANCE KILLER**, or at least it was |
118 |
for me, effectively giving you half-device speed writes because the data |
119 |
in too many cases must be read in first before it can be written. Raid1 |
120 |
doesn't have that problem -- it doesn't get a write performance |
121 |
multiplier from the N devices, but at least it doesn't get device |
122 |
performance cut in half like raid5/6 does. |
123 |
|
124 |
Third, the read-scheduling benefits of #1 help to a lessor extent with |
125 |
large same-raid1 copies as well. Consider, the first block must be read |
126 |
by one device, then written to all at the new location. The second |
127 |
similarly, then the third, etc. But, with proper scheduling an N-way |
128 |
raid1 doing an N-block copy has done N+1 operations on all devices at the |
129 |
end of that N-block copy. IOW, given the memory to use as a buffer, the |
130 |
read can be done in parallel, reading N blocks in at once, one from each |
131 |
device, then the writes, one block at a time to all devices. So a 5-way |
132 |
raid1 will have done 6 jobs on each of the 5 devices at the end, 1 read |
133 |
and 5 writes, to write out 5 blocks. (In actuality due to read-ahead I |
134 |
think it's optimally 64-k blocks per device, 16 4-k blocks on each, 320k |
135 |
total, but that's well within the usual minimal 2MB drive buffer size, |
136 |
and the drive will probably do that on its own if both read and write- |
137 |
caching are on, given scheduling that forces a cache-flush only at the |
138 |
end, not multiple times in the middle. So all the kernel has to do is be |
139 |
sure it's not interfering by forcing untimely flushes, and the drives |
140 |
should optimize on their own.) |
141 |
|
142 |
Forth, back to the parity. Remember, raid5/6 has all that parity that it |
143 |
writes out (but basically never reads in normal mode, only when degraded, |
144 |
in ordered to reconstruct the data from the missing device(s)), but |
145 |
doesn't actually use it for integrity checking. So while raid1 doesn't |
146 |
have the benefit of that parity data, it's not like raid5/6 used it |
147 |
anyway, and an N-way raid1 means even MORE missing-device protection |
148 |
since you can lose all but one device and keep right on going as if |
149 |
nothing happened. So a 5-way raid1 can lose 4 devices, not just the two |
150 |
devices of a 5-way raid6 or the single device of a raid5. Yes, there's |
151 |
the loss of parity/integrity data with raid1, BUT RAID5/6 DOESN'T USE |
152 |
THAT DATA FOR INTEGRITY CHECKING ANYWAY, ONLY FOR RECONSTRUCTION IN THE |
153 |
CASE OF DEVICE LOSS! So the N-way raid1 is far more redundant since you |
154 |
have N copies of the data, not one copy plus two-way-parity-that's-never- |
155 |
used-except-for-reconstruction. |
156 |
|
157 |
Fifth, in the event of device loss, a raid1 continues to function at |
158 |
normal speed, because it's simply an N-way copy with a bit of extra |
159 |
metadata to keep track of the number of N-ways. Of course you'll lose |
160 |
the benefit of read-parallelization that the missing device provided, and |
161 |
you'll lose the redundancy of the missing device, but in general, |
162 |
performance remains pretty much the same no matter how many ways it's |
163 |
raid-1 mirrored. Contrast that with raid5/6 which is SEVERELY read |
164 |
performance impacted by device loss, since it then must reconstruct the |
165 |
data using the parity data, not simply read it from somewhere else, which |
166 |
is what raid1 does. |
167 |
|
168 |
The single down side to raid1 as opposed to raid5/6 is the loss of the |
169 |
extra space made available by the data striping, 3*single-device-space in |
170 |
the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space in the |
171 |
case of raid1. Otherwise, no contest, hands down, raid1 over raid6. |
172 |
|
173 |
|
174 |
IOW, you're seeing now exactly why raid6 and to a lessor extent raid5 |
175 |
have such terrible performance (as opposed to reliability) reputations. |
176 |
Really, unless you simply don't have the space to make it raid1, I |
177 |
**STRONGLY** urge you to try that instead. I know I was very happily |
178 |
surprised by the results I got, and only then realized what all the |
179 |
negativity I'd seen around raid5/6 had been about, as I really hadn't |
180 |
understood it at all when I was doing my original research. |
181 |
|
182 |
|
183 |
Meanwhile, Rich0 already brought up btrfs, which really does promise a |
184 |
better solution to many of these issues than md/raid, in part due to that |
185 |
is arguably "layering violation", but really DOES allow for some serious |
186 |
optimizations in the multiple-drive case, because as a filesystem, it |
187 |
DOES know what's real data and what's empty space that isn't worth |
188 |
worrying about, and because unlike raid5/6 parity, it really DOES care |
189 |
about data integrity, not just rebuilding in case of device failure. |
190 |
|
191 |
So several points on btrfs: |
192 |
|
193 |
1) It's still in heavy development. The base single-device filesystem |
194 |
case works reasonably well now and is /almost/ stable, tho I'd still urge |
195 |
people to keep good backups as it's simply not time tested and settled, |
196 |
and won't be for at least a few more kernels as they're still busy with |
197 |
the other features. Second-level raid0/raid1/raid10 is at an |
198 |
intermediate level. Primary development and initial bug testing and |
199 |
fixing is done but they're still working on bugs that people doing only |
200 |
traditional single-device filesystems simply don't have to worry about. |
201 |
Third-round raid5/6 is still very new, introduced as VERY experimental |
202 |
only with 3.9 IIRC, and is currently EXPECTED to eat data in power-loss |
203 |
or crash events, so it's ONLY good for preliminary testing at this point. |
204 |
|
205 |
Thus, if you're using btrfs at all, keep good backups, and keep current, |
206 |
even -rc (if not live-git) on the kernel, because there really are fixes |
207 |
in every single kernel for very real corner-case problems they are still |
208 |
coming across. But single-device is /relatively/ stable now, so provided |
209 |
you keep good *TESTED* backups and are willing and able to use them if it |
210 |
comes to it, and keep current on the kernel, go for that. And I'm |
211 |
personally running dual-device raid-1 mode across two SSDs, at the second |
212 |
stage deployment level. I tried that (but still on spinning rust) a year |
213 |
ago and decided btrfs simply wasn't ready for me yet, so it has come |
214 |
quite a way in the last year. But raid5/6 mode is still fresh third-tier |
215 |
development, which I'd not consider usable until at LEAST 3.11 and |
216 |
probably 3.12 or later (maybe a year from now, since it's less mature |
217 |
than raid1 was at this point last year, but should mature a bit faster). |
218 |
|
219 |
Takeaway: If you don't have a backup you're prepared to use, you |
220 |
shouldn't be even THINKING about btrfs at this point, no matter WHAT type |
221 |
of deployment you're considering. If you do, you're probably reasonably |
222 |
safe with traditional single-device btrfs, intermediately risky/safe with |
223 |
raid0/1/10, don't even think about raid5/6 for real deployment yet, |
224 |
period. |
225 |
|
226 |
2) RAID levels work QUITE a bit differently on btrfs. In particular, |
227 |
what btrfs calls raid1 mode (with the same applying to raid10) is simply |
228 |
two-way-mirroring, NO MATTER THE NUMBER OF DEVICES. There's no multi-way |
229 |
mirroring yet available, unless you're willing to apply not-yet- |
230 |
mainstreamed patches. It's planned, but not yet applied. The roadmap |
231 |
says it'll happen after raid5/6 are introduced (they have been, but |
232 |
aren't yet really finished including power-loss-recovery, etc), so I'm |
233 |
guessing 3.12 at the earliest as I think 3.11 is still focused on raid5/6 |
234 |
completion. |
235 |
|
236 |
3) Btrfs raid1 mode is used to provide second-source for its data |
237 |
integrity feature as well, such that if one copy's checksum doesn't |
238 |
verify, it'll try the other one. Unfortunately #2 means there's only the |
239 |
single fallback to try, but that's better than most filesystems, without |
240 |
data integrity at all, or if they have it, no fallback if it fails. |
241 |
|
242 |
The combination of #2 and 3 was a bitter pill for me a year ago, when I |
243 |
was still running on aging spinning rust, and thus didn't trust two-copy- |
244 |
only redundancy. I really like the data integrity feature, but just a |
245 |
single backup copy was a great disappointment since I didn't trust my old |
246 |
hardware, and unfortunately two-copy-max remains the case for so-called |
247 |
raid1. (Raid5/6 mode apparently introduces N-way copies or some such, |
248 |
but as I said, it's not complete yet and is EXPECTED to eat data. N-way- |
249 |
mirroring will build on that and is on the horizon, but it has been on |
250 |
the horizon and not seeming to get much closer for over a year now...) |
251 |
Fortunately for me, my budget is in far better shape this year, and with |
252 |
the dual new SSDs I purchased and with spinning rust for backup still, I |
253 |
trust my hardware enough now to run the 2-way-only mirroring that btrfs |
254 |
calls raid1 mode. |
255 |
|
256 |
4) As mentioned above in the btrfs intro paragraph, btrfs, being a |
257 |
filesystem, actually knows what data is actual data, and what is safely |
258 |
left untracked and thus unsynced. Thus, the read-data-in-before-writing- |
259 |
it problem will be rather less, certainly on freshly formated disks where |
260 |
most existing data WILL be garbage/zeros (trimmed if on SSD, as mkfs.btrfs |
261 |
issues a trim command for the entire filesystem range before it creates |
262 |
the superblocks, etc, so empty space really /is/ zeroed). Similarly with |
263 |
"slack space" that's not currently used but was used previously, as the |
264 |
filesystem ages -- btrfs knows that it can ignore that data too, and thus |
265 |
won't have to read it in to update the checksum when writing to a raid5/6 |
266 |
mode btrfs. |
267 |
|
268 |
5) There's various other nice btrfs features and a few caveats as well, |
269 |
but with the exception of anything btrfs-raid pertaining I totally forgot |
270 |
about, they're out of scope for this thread, which is after all, on raid, |
271 |
so I'll skip discussing them here. |
272 |
|
273 |
|
274 |
So bottom line, I really recommend md/raid1 for now. Unless you want to |
275 |
go md/raid10, with three-way-mirroring on the raid1 side. AFAIK that's |
276 |
doable with 5 devices, but it's simpler, certainly conceptually simpler |
277 |
which can make a different to an admin trying to work with it, with 6. |
278 |
|
279 |
If the data simply won't fit on the 5-way raid1 and you want to keep at |
280 |
least 2-device-loss protection, consider splitting it up, raid1 with |
281 |
three devices for the first half, then either get a sixth device to do |
282 |
the same with the second half, or go raid1 with two devices and put your |
283 |
less critical data on the second set. Or, do the raid10 with 5 devices |
284 |
thing, but I'll admit that while I've read that it's possible, I don't |
285 |
really conceptually understand it myself, and haven't tried it, so I have |
286 |
no personal opinion or experience to offer on that. But in that case I |
287 |
really would try to scrap up the money for a sixth device if possible, |
288 |
and do raid10 with 3-way redundancy 2-way-striping across the six, simply |
289 |
because it's easier to conceptualize and thus to properly administer. |
290 |
|
291 |
-- |
292 |
Duncan - List replies preferred. No HTML msgs. |
293 |
"Every nonfree program has a lord, a master -- |
294 |
and if you use the program, he is your master." Richard Stallman |