Gentoo Archives: gentoo-amd64

From: Duncan <1i5t5.duncan@×××.net>
To: gentoo-amd64@l.g.o
Subject: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
Date: Fri, 21 Jun 2013 07:31:58
Message-Id: pan$ecf3f$9af69a78$1508667e$d81347b7@cox.net
In Reply to: [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? by Mark Knecht
1 Mark Knecht posted on Thu, 20 Jun 2013 12:10:04 -0700 as excerpted:
2
3 > Does anyone know of info on how the starting sector number might
4 > impact RAID performance under Gentoo? The drives are WD-500G RE3 drives
5 > shown here:
6 >
7 > http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/
8 B001EMZPD0/ref=cm_cr_pr_product_top
9 >
10 > These are NOT 4k sector sized drives.
11 >
12 > Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My
13 > benchmarking seems abysmal at around 40MB/S using dd copying large
14 > files.
15 > It's higher, around 80MB/S if the file being transferred is coming from
16 > an SSD, but even 80MB/S seems slow to me. I see a LOT of wait time in
17 > top.
18 > And my 'large file' copies might not be large enough as the machine has
19 > 24GB of DRAM and I've only been copying 21GB so it's possible some of
20 > that is cached.
21
22 I /suspect/ that the problem isn't striping, tho that can be a factor,
23 but rather, your choice of raid6. Note that I personally ran md/raid-6
24 here for awhile, so I know a bit of what I'm talking about. I didn't
25 realize the full implications of what I was setting up originally, or I'd
26 have not chosen raid6 in the first place, but live and learn as they say,
27 and that I did.
28
29 General rule, raid6 is abysmal for writing and gets dramatically worse as
30 fragmentation sets in, tho reading is reasonable. The reason is that in
31 ordered to properly parity-check and write out less-than-full-stripe
32 writes, the system must effectively read-in the existing data and merge
33 it with the new data, then recalculate the parity, before writing the new
34 data AND 100% of the (two-way in raid-6) parity. Further, because raid
35 sits below the filesystem level, it knows nothing about what parts of the
36 filesystem are actually used, and must read and write the FULL data
37 stripe (perhaps minus the new data bit, I'm not sure), including parts
38 that will be empty on a freshly formatted filesystem.
39
40 So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k
41 in data across three devices, and 8k of parity across the other two
42 devices. Now you go to write a 1k file, but in ordered to do so the full
43 12k of existing data must be read in, even on an empty filesystem,
44 because the RAID doesn't know it's empty! Then the new data must be
45 merged in and new checksums created, then the full 20k must be written
46 back out, certainly the 8k of parity, but also likely the full 12k of
47 data even if most of it is simply rewrite, but almost certainly at least
48 the 4k strip on the device the new data is written to.
49
50 As I said that gets much worse as a filesystem ages, due to fragmentation
51 meaning writes are more often writes to say 3 stripe fragments instead of
52 a single whole stripe. That's what proper stride size, etc, can help
53 with, if the filesystem's reasonably fragmentation resistant, but even
54 then filesystem aging certainly won't /help/.
55
56 Reads, meanwhile, are reasonable speed (in normal non-degraded mode),
57 because on a raid6 the data is at least two-way striped (on a 4-device
58 raid, your 5-device would be three-way striped data, the other two being
59 parity of course), so you do get moderate striping read bonuses.
60
61 Then there's all that parity information available and written out at
62 every write, but it's not actually used to check the reliability of the
63 data in normal operation, only to reconstruct if a device or two goes
64 missing.
65
66 On a well laid out system, I/O to the separate drives at least shouldn't
67 interfere with each other, assuming SATA and a chipset and bus layout
68 that can handle them in parallel, not /that/ big a feat on today's
69 hardware at least as long as you're still doing "spinning rust", as the
70 mechanical drive latency is almost certainly the bottleneck there, and at
71 least that can be parallelized to a reasonable degree across the
72 individual drives.
73
74 What I ultimately came to realize here is that unless the job at hand is
75 nearly 100% read on the raid, with the caveat that you have enough space,
76 raid1 is almost certainly at least as good if not a better choice. If
77 you have the devices to support it, you can go for raid10/50/60, and a
78 raid10 across 5 devices is certainly possible with mdraid, but a straight
79 raid-6... you're generally better off with an N-way raid-1, for a couple
80 reasons.
81
82 First, md/raid1 is surprisingly, even astoundingly, good at multi-task
83 scheduling reads. So any time there's multiple I/O read tasks going on
84 (like during boot), raid1 works really well, with the scheduler
85 distributing tasks among the available devices, this minimizing seek-
86 latency. So take a 5-device raid-1, you can very likely accomplish at
87 least 5 and possibly 6 or even 7 read jobs in say 110 or 120% of the time
88 it would take to do just the longest one on a single device, almost
89 certainly well before a single device could have done the two longest
90 read jobs. This also works if there's a single task alternating reads of
91 N different files/directories, since the scheduler will again distribute
92 jobs among the devices, so say one device head stays over the directory
93 information, while another goes to read the first file, the second reads
94 another file, etc, and the heads stay where they are until they're needed
95 elsewhere so the more devices in raid1 you have the more likely it is
96 that more data read from the same location still has a head located right
97 over it and can just read it as the correct portion of the disk spins
98 underneath, instead of first seeking to the correct spot on the disk.
99
100 It's worth pointing out that in the case of parallel job read access, due
101 to this parallel read-scheduling md/raid1 can often best raid0
102 performance, despite raid0's technically better single-job thruput
103 numbers. This was something I learned by experience as well, that makes
104 sense but that I had TOTALLY not realized or calculated for in my
105 original setup, as I was running raid0 for things like the gentoo ebuild
106 tree and the kernel sources, since I didn't need redundancy for them. My
107 raid0 performance there was rather disappointing, because both portage
108 tree updates and dep calculation and the kernel build process don't
109 optimize well for thruput, which is what raid0 does, but optimize rather
110 better for parallel I/O, where raid1 shines especially for read.
111
112 Second, md/raid1 writes, because they happen in parallel with the
113 bottleneck being the spinning rust, basically occur at the speed of the
114 slowest disk. So you don't get N-way parallel job write speed, just
115 single disk speed, but it's still *WAY* *WAY* better than raid6, which
116 has to read in the existing data and do the merge before it can write
117 back out. **THAT'S THE RAID6 PERFORMANCE KILLER**, or at least it was
118 for me, effectively giving you half-device speed writes because the data
119 in too many cases must be read in first before it can be written. Raid1
120 doesn't have that problem -- it doesn't get a write performance
121 multiplier from the N devices, but at least it doesn't get device
122 performance cut in half like raid5/6 does.
123
124 Third, the read-scheduling benefits of #1 help to a lessor extent with
125 large same-raid1 copies as well. Consider, the first block must be read
126 by one device, then written to all at the new location. The second
127 similarly, then the third, etc. But, with proper scheduling an N-way
128 raid1 doing an N-block copy has done N+1 operations on all devices at the
129 end of that N-block copy. IOW, given the memory to use as a buffer, the
130 read can be done in parallel, reading N blocks in at once, one from each
131 device, then the writes, one block at a time to all devices. So a 5-way
132 raid1 will have done 6 jobs on each of the 5 devices at the end, 1 read
133 and 5 writes, to write out 5 blocks. (In actuality due to read-ahead I
134 think it's optimally 64-k blocks per device, 16 4-k blocks on each, 320k
135 total, but that's well within the usual minimal 2MB drive buffer size,
136 and the drive will probably do that on its own if both read and write-
137 caching are on, given scheduling that forces a cache-flush only at the
138 end, not multiple times in the middle. So all the kernel has to do is be
139 sure it's not interfering by forcing untimely flushes, and the drives
140 should optimize on their own.)
141
142 Forth, back to the parity. Remember, raid5/6 has all that parity that it
143 writes out (but basically never reads in normal mode, only when degraded,
144 in ordered to reconstruct the data from the missing device(s)), but
145 doesn't actually use it for integrity checking. So while raid1 doesn't
146 have the benefit of that parity data, it's not like raid5/6 used it
147 anyway, and an N-way raid1 means even MORE missing-device protection
148 since you can lose all but one device and keep right on going as if
149 nothing happened. So a 5-way raid1 can lose 4 devices, not just the two
150 devices of a 5-way raid6 or the single device of a raid5. Yes, there's
151 the loss of parity/integrity data with raid1, BUT RAID5/6 DOESN'T USE
152 THAT DATA FOR INTEGRITY CHECKING ANYWAY, ONLY FOR RECONSTRUCTION IN THE
153 CASE OF DEVICE LOSS! So the N-way raid1 is far more redundant since you
154 have N copies of the data, not one copy plus two-way-parity-that's-never-
155 used-except-for-reconstruction.
156
157 Fifth, in the event of device loss, a raid1 continues to function at
158 normal speed, because it's simply an N-way copy with a bit of extra
159 metadata to keep track of the number of N-ways. Of course you'll lose
160 the benefit of read-parallelization that the missing device provided, and
161 you'll lose the redundancy of the missing device, but in general,
162 performance remains pretty much the same no matter how many ways it's
163 raid-1 mirrored. Contrast that with raid5/6 which is SEVERELY read
164 performance impacted by device loss, since it then must reconstruct the
165 data using the parity data, not simply read it from somewhere else, which
166 is what raid1 does.
167
168 The single down side to raid1 as opposed to raid5/6 is the loss of the
169 extra space made available by the data striping, 3*single-device-space in
170 the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space in the
171 case of raid1. Otherwise, no contest, hands down, raid1 over raid6.
172
173
174 IOW, you're seeing now exactly why raid6 and to a lessor extent raid5
175 have such terrible performance (as opposed to reliability) reputations.
176 Really, unless you simply don't have the space to make it raid1, I
177 **STRONGLY** urge you to try that instead. I know I was very happily
178 surprised by the results I got, and only then realized what all the
179 negativity I'd seen around raid5/6 had been about, as I really hadn't
180 understood it at all when I was doing my original research.
181
182
183 Meanwhile, Rich0 already brought up btrfs, which really does promise a
184 better solution to many of these issues than md/raid, in part due to that
185 is arguably "layering violation", but really DOES allow for some serious
186 optimizations in the multiple-drive case, because as a filesystem, it
187 DOES know what's real data and what's empty space that isn't worth
188 worrying about, and because unlike raid5/6 parity, it really DOES care
189 about data integrity, not just rebuilding in case of device failure.
190
191 So several points on btrfs:
192
193 1) It's still in heavy development. The base single-device filesystem
194 case works reasonably well now and is /almost/ stable, tho I'd still urge
195 people to keep good backups as it's simply not time tested and settled,
196 and won't be for at least a few more kernels as they're still busy with
197 the other features. Second-level raid0/raid1/raid10 is at an
198 intermediate level. Primary development and initial bug testing and
199 fixing is done but they're still working on bugs that people doing only
200 traditional single-device filesystems simply don't have to worry about.
201 Third-round raid5/6 is still very new, introduced as VERY experimental
202 only with 3.9 IIRC, and is currently EXPECTED to eat data in power-loss
203 or crash events, so it's ONLY good for preliminary testing at this point.
204
205 Thus, if you're using btrfs at all, keep good backups, and keep current,
206 even -rc (if not live-git) on the kernel, because there really are fixes
207 in every single kernel for very real corner-case problems they are still
208 coming across. But single-device is /relatively/ stable now, so provided
209 you keep good *TESTED* backups and are willing and able to use them if it
210 comes to it, and keep current on the kernel, go for that. And I'm
211 personally running dual-device raid-1 mode across two SSDs, at the second
212 stage deployment level. I tried that (but still on spinning rust) a year
213 ago and decided btrfs simply wasn't ready for me yet, so it has come
214 quite a way in the last year. But raid5/6 mode is still fresh third-tier
215 development, which I'd not consider usable until at LEAST 3.11 and
216 probably 3.12 or later (maybe a year from now, since it's less mature
217 than raid1 was at this point last year, but should mature a bit faster).
218
219 Takeaway: If you don't have a backup you're prepared to use, you
220 shouldn't be even THINKING about btrfs at this point, no matter WHAT type
221 of deployment you're considering. If you do, you're probably reasonably
222 safe with traditional single-device btrfs, intermediately risky/safe with
223 raid0/1/10, don't even think about raid5/6 for real deployment yet,
224 period.
225
226 2) RAID levels work QUITE a bit differently on btrfs. In particular,
227 what btrfs calls raid1 mode (with the same applying to raid10) is simply
228 two-way-mirroring, NO MATTER THE NUMBER OF DEVICES. There's no multi-way
229 mirroring yet available, unless you're willing to apply not-yet-
230 mainstreamed patches. It's planned, but not yet applied. The roadmap
231 says it'll happen after raid5/6 are introduced (they have been, but
232 aren't yet really finished including power-loss-recovery, etc), so I'm
233 guessing 3.12 at the earliest as I think 3.11 is still focused on raid5/6
234 completion.
235
236 3) Btrfs raid1 mode is used to provide second-source for its data
237 integrity feature as well, such that if one copy's checksum doesn't
238 verify, it'll try the other one. Unfortunately #2 means there's only the
239 single fallback to try, but that's better than most filesystems, without
240 data integrity at all, or if they have it, no fallback if it fails.
241
242 The combination of #2 and 3 was a bitter pill for me a year ago, when I
243 was still running on aging spinning rust, and thus didn't trust two-copy-
244 only redundancy. I really like the data integrity feature, but just a
245 single backup copy was a great disappointment since I didn't trust my old
246 hardware, and unfortunately two-copy-max remains the case for so-called
247 raid1. (Raid5/6 mode apparently introduces N-way copies or some such,
248 but as I said, it's not complete yet and is EXPECTED to eat data. N-way-
249 mirroring will build on that and is on the horizon, but it has been on
250 the horizon and not seeming to get much closer for over a year now...)
251 Fortunately for me, my budget is in far better shape this year, and with
252 the dual new SSDs I purchased and with spinning rust for backup still, I
253 trust my hardware enough now to run the 2-way-only mirroring that btrfs
254 calls raid1 mode.
255
256 4) As mentioned above in the btrfs intro paragraph, btrfs, being a
257 filesystem, actually knows what data is actual data, and what is safely
258 left untracked and thus unsynced. Thus, the read-data-in-before-writing-
259 it problem will be rather less, certainly on freshly formated disks where
260 most existing data WILL be garbage/zeros (trimmed if on SSD, as mkfs.btrfs
261 issues a trim command for the entire filesystem range before it creates
262 the superblocks, etc, so empty space really /is/ zeroed). Similarly with
263 "slack space" that's not currently used but was used previously, as the
264 filesystem ages -- btrfs knows that it can ignore that data too, and thus
265 won't have to read it in to update the checksum when writing to a raid5/6
266 mode btrfs.
267
268 5) There's various other nice btrfs features and a few caveats as well,
269 but with the exception of anything btrfs-raid pertaining I totally forgot
270 about, they're out of scope for this thread, which is after all, on raid,
271 so I'll skip discussing them here.
272
273
274 So bottom line, I really recommend md/raid1 for now. Unless you want to
275 go md/raid10, with three-way-mirroring on the raid1 side. AFAIK that's
276 doable with 5 devices, but it's simpler, certainly conceptually simpler
277 which can make a different to an admin trying to work with it, with 6.
278
279 If the data simply won't fit on the 5-way raid1 and you want to keep at
280 least 2-device-loss protection, consider splitting it up, raid1 with
281 three devices for the first half, then either get a sixth device to do
282 the same with the second half, or go raid1 with two devices and put your
283 less critical data on the second set. Or, do the raid10 with 5 devices
284 thing, but I'll admit that while I've read that it's possible, I don't
285 really conceptually understand it myself, and haven't tried it, so I have
286 no personal opinion or experience to offer on that. But in that case I
287 really would try to scrap up the money for a sixth device if possible,
288 and do raid10 with 3-way redundancy 2-way-striping across the six, simply
289 because it's easier to conceptualize and thus to properly administer.
290
291 --
292 Duncan - List replies preferred. No HTML msgs.
293 "Every nonfree program has a lord, a master --
294 and if you use the program, he is your master." Richard Stallman

Replies