[gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? - gentoo-amd64

From:	Duncan <1i5t5.duncan@×××.net>
To:	gentoo-amd64@l.g.o
Subject:	[gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
Date:	Fri, 21 Jun 2013 07:31:58
Message-Id:	`pan$ecf3f$9af69a78$1508667e$d81347b7@cox.net`
In Reply to:	[gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? by Mark Knecht

1

Mark Knecht posted on Thu, 20 Jun 2013 12:10:04 -0700 as excerpted:

2

3

>    Does anyone know of info on how the starting sector number might

4

> impact RAID performance under Gentoo? The drives are WD-500G RE3 drives

5

> shown here:

6

>

7

> http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/

8

B001EMZPD0/ref=cm_cr_pr_product_top

9

>

10

>    These are NOT 4k sector sized drives.

11

>

12

>    Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My

13

> benchmarking seems abysmal at around 40MB/S using dd copying large

14

> files.

15

> It's higher, around 80MB/S if the file being transferred is coming from

16

> an SSD, but even 80MB/S seems slow to me. I see a LOT of wait time in

17

> top.

18

> And my 'large file' copies might not be large enough as the machine has

19

> 24GB of DRAM and I've only been copying 21GB so it's possible some of

20

> that is cached.

21

22

I /suspect/ that the problem isn't striping, tho that can be a factor, 

23

but rather, your choice of raid6.  Note that I personally ran md/raid-6 

24

here for awhile, so I know a bit of what I'm talking about.  I didn't 

25

realize the full implications of what I was setting up originally, or I'd 

26

have not chosen raid6 in the first place, but live and learn as they say, 

27

and that I did.

28

29

General rule, raid6 is abysmal for writing and gets dramatically worse as 

30

fragmentation sets in, tho reading is reasonable.  The reason is that in 

31

ordered to properly parity-check and write out less-than-full-stripe 

32

writes, the system must effectively read-in the existing data and merge 

33

it with the new data, then recalculate the parity, before writing the new 

34

data AND 100% of the (two-way in raid-6) parity.  Further, because raid 

35

sits below the filesystem level, it knows nothing about what parts of the 

36

filesystem are actually used, and must read and write the FULL data 

37

stripe (perhaps minus the new data bit, I'm not sure), including parts 

38

that will be empty on a freshly formatted filesystem.

39

40

So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k 

41

in data across three devices, and 8k of parity across the other two 

42

devices.  Now you go to write a 1k file, but in ordered to do so the full 

43

12k of existing data must be read in, even on an empty filesystem, 

44

because the RAID doesn't know it's empty!  Then the new data must be 

45

merged in and new checksums created, then the full 20k must be written 

46

back out, certainly the 8k of parity, but also likely the full 12k of 

47

data even if most of it is simply rewrite, but almost certainly at least 

48

the 4k strip on the device the new data is written to.

49

50

As I said that gets much worse as a filesystem ages, due to fragmentation 

51

meaning writes are more often writes to say 3 stripe fragments instead of 

52

a single whole stripe.  That's what proper stride size, etc, can help 

53

with, if the filesystem's reasonably fragmentation resistant, but even 

54

then filesystem aging certainly won't /help/.

55

56

Reads, meanwhile, are reasonable speed (in normal non-degraded mode), 

57

because on a raid6 the data is at least two-way striped (on a 4-device 

58

raid, your 5-device would be three-way striped data, the other two being 

59

parity of course), so you do get moderate striping read bonuses.

60

61

Then there's all that parity information available and written out at 

62

every write, but it's not actually used to check the reliability of the 

63

data in normal operation, only to reconstruct if a device or two goes 

64

missing.

65

66

On a well laid out system, I/O to the separate drives at least shouldn't 

67

interfere with each other, assuming SATA and a chipset and bus layout 

68

that can handle them in parallel, not /that/ big a feat on today's 

69

hardware at least as long as you're still doing "spinning rust", as the 

70

mechanical drive latency is almost certainly the bottleneck there, and at 

71

least that can be parallelized to a reasonable degree across the 

72

individual drives.

73

74

What I ultimately came to realize here is that unless the job at hand is 

75

nearly 100% read on the raid, with the caveat that you have enough space, 

76

raid1 is almost certainly at least as good if not a better choice.  If 

77

you have the devices to support it, you can go for raid10/50/60, and a 

78

raid10 across 5 devices is certainly possible with mdraid, but a straight 

79

raid-6... you're generally better off with an N-way raid-1, for a couple 

80

reasons.

81

82

First, md/raid1 is surprisingly, even astoundingly, good at multi-task 

83

scheduling reads.  So any time there's multiple I/O read tasks going on 

84

(like during boot), raid1 works really well, with the scheduler 

85

distributing tasks among the available devices, this minimizing seek-

86

latency.  So take a 5-device raid-1, you can very likely accomplish at 

87

least 5 and possibly 6 or even 7 read jobs in say 110 or 120% of the time 

88

it would take to do just the longest one on a single device, almost 

89

certainly well before a single device could have done the two longest 

90

read jobs.  This also works if there's a single task alternating reads of 

91

N different files/directories, since the scheduler will again distribute 

92

jobs among the devices, so say one device head stays over the directory 

93

information, while another goes to read the first file, the second reads 

94

another file, etc, and the heads stay where they are until they're needed 

95

elsewhere so the more devices in raid1 you have the more likely it is 

96

that more data read from the same location still has a head located right 

97

over it and can just read it as the correct portion of the disk spins 

98

underneath, instead of first seeking to the correct spot on the disk.

99

100

It's worth pointing out that in the case of parallel job read access, due 

101

to this parallel read-scheduling md/raid1 can often best raid0 

102

performance, despite raid0's technically better single-job thruput 

103

numbers.  This was something I learned by experience as well, that makes 

104

sense but that I had TOTALLY not realized or calculated for in my 

105

original setup, as I was running raid0 for things like the gentoo ebuild 

106

tree and the kernel sources, since I didn't need redundancy for them.  My 

107

raid0 performance there was rather disappointing, because both portage 

108

tree updates and dep calculation and the kernel build process don't 

109

optimize well for thruput, which is what raid0 does, but optimize rather 

110

better for parallel I/O, where raid1 shines especially for read.

111

112

Second, md/raid1 writes, because they happen in parallel with the 

113

bottleneck being the spinning rust, basically occur at the speed of the 

114

slowest disk.  So you don't get N-way parallel job write speed, just 

115

single disk speed, but it's still *WAY* *WAY* better than raid6, which 

116

has to read in the existing data and do the merge before it can write 

117

back out.  **THAT'S THE RAID6 PERFORMANCE KILLER**, or at least it was 

118

for me, effectively giving you half-device speed writes because the data 

119

in too many cases must be read in first before it can be written.  Raid1 

120

doesn't have that problem -- it doesn't get a write performance 

121

multiplier from the N devices, but at least it doesn't get device 

122

performance cut in half like raid5/6 does.

123

124

Third, the read-scheduling benefits of #1 help to a lessor extent with 

125

large same-raid1 copies as well.  Consider, the first block must be read 

126

by one device, then written to all at the new location.  The second 

127

similarly, then the third, etc.  But, with proper scheduling an N-way  

128

raid1 doing an N-block copy has done N+1 operations on all devices at the 

129

end of that N-block copy.  IOW, given the memory to use as a buffer, the 

130

read can be done in parallel, reading N blocks in at once, one from each 

131

device, then the writes, one block at a time to all devices.  So a 5-way 

132

raid1 will have done 6 jobs on each of the 5 devices at the end, 1 read 

133

and 5 writes, to write out 5 blocks.  (In actuality due to read-ahead I 

134

think it's optimally 64-k blocks per device, 16 4-k blocks on each, 320k 

135

total, but that's well within the usual minimal 2MB drive buffer size, 

136

and the drive will probably do that on its own if both read and write-

137

caching are on, given scheduling that forces a cache-flush only at the 

138

end, not multiple times in the middle.  So all the kernel has to do is be 

139

sure it's not interfering by forcing untimely flushes, and the drives 

140

should optimize on their own.)

141

142

Forth, back to the parity.  Remember, raid5/6 has all that parity that it 

143

writes out (but basically never reads in normal mode, only when degraded, 

144

in ordered to reconstruct the data from the missing device(s)), but 

145

doesn't actually use it for integrity checking.  So while raid1 doesn't 

146

have the benefit of that parity data, it's not like raid5/6 used it 

147

anyway, and an N-way raid1 means even MORE missing-device protection 

148

since you can lose all but one device and keep right on going as if 

149

nothing happened.  So a 5-way raid1 can lose 4 devices, not just the two 

150

devices of a 5-way raid6 or the single device of a raid5.  Yes, there's 

151

the loss of parity/integrity data with raid1, BUT RAID5/6 DOESN'T USE 

152

THAT DATA FOR INTEGRITY CHECKING ANYWAY, ONLY FOR RECONSTRUCTION IN THE 

153

CASE OF DEVICE LOSS!  So the N-way raid1 is far more redundant since you 

154

have N copies of the data, not one copy plus two-way-parity-that's-never-

155

used-except-for-reconstruction.

156

157

Fifth, in the event of device loss, a raid1 continues to function at 

158

normal speed, because it's simply an N-way copy with a bit of extra 

159

metadata to keep track of the number of N-ways.  Of course you'll lose 

160

the benefit of read-parallelization that the missing device provided, and 

161

you'll lose the redundancy of the missing device, but in general, 

162

performance remains pretty much the same no matter how many ways it's 

163

raid-1 mirrored.  Contrast that with raid5/6 which is SEVERELY read 

164

performance impacted by device loss, since it then must reconstruct the 

165

data using the parity data, not simply read it from somewhere else, which 

166

is what raid1 does.

167

168

The single down side to raid1 as opposed to raid5/6 is the loss of the 

169

extra space made available by the data striping, 3*single-device-space in 

170

the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space in the 

171

case of raid1.  Otherwise, no contest, hands down, raid1 over raid6.

172

173

174

IOW, you're seeing now exactly why raid6 and to a lessor extent raid5 

175

have such terrible performance (as opposed to reliability) reputations.  

176

Really, unless you simply don't have the space to make it raid1, I 

177

**STRONGLY** urge you to try that instead.  I know I was very happily 

178

surprised by the results I got, and only then realized what all the 

179

negativity I'd seen around raid5/6 had been about, as I really hadn't 

180

understood it at all when I was doing my original research.

181

182

183

Meanwhile, Rich0 already brought up btrfs, which really does promise a 

184

better solution to many of these issues than md/raid, in part due to that 

185

is arguably "layering violation", but really DOES allow for some serious 

186

optimizations in the multiple-drive case, because as a filesystem, it 

187

DOES know what's real data and what's empty space that isn't worth 

188

worrying about, and because unlike raid5/6 parity, it really DOES care 

189

about data integrity, not just rebuilding in case of device failure.

190

191

So several points on btrfs:

192

193

1) It's still in heavy development.  The base single-device filesystem 

194

case works reasonably well now and is /almost/ stable, tho I'd still urge 

195

people to keep good backups as it's simply not time tested and settled, 

196

and won't be for at least a few more kernels as they're still busy with 

197

the other features.  Second-level raid0/raid1/raid10 is at an 

198

intermediate level.  Primary development and initial bug testing and 

199

fixing is done but they're still working on bugs that people doing only 

200

traditional single-device filesystems simply don't have to worry about.  

201

Third-round raid5/6 is still very new, introduced as VERY experimental 

202

only with 3.9 IIRC, and is currently EXPECTED to eat data in power-loss 

203

or crash events, so it's ONLY good for preliminary testing at this point.

204

205

Thus, if you're using btrfs at all, keep good backups, and keep current, 

206

even -rc (if not live-git) on the kernel, because there really are fixes 

207

in every single kernel for very real corner-case problems they are still 

208

coming across.  But single-device is /relatively/ stable now, so provided 

209

you keep good *TESTED* backups and are willing and able to use them if it 

210

comes to it, and keep current on the kernel, go for that.  And I'm 

211

personally running dual-device raid-1 mode across two SSDs, at the second 

212

stage deployment level.  I tried that (but still on spinning rust) a year 

213

ago and decided btrfs simply wasn't ready for me yet, so it has come 

214

quite a way in the last year.  But raid5/6 mode is still fresh third-tier 

215

development, which I'd not consider usable until at LEAST 3.11 and 

216

probably 3.12 or later (maybe a year from now, since it's less mature 

217

than raid1 was at this point last year, but should mature a bit faster).

218

219

Takeaway: If you don't have a backup you're prepared to use, you 

220

shouldn't be even THINKING about btrfs at this point, no matter WHAT type 

221

of deployment you're considering.  If you do, you're probably reasonably 

222

safe with traditional single-device btrfs, intermediately risky/safe with 

223

raid0/1/10, don't even think about raid5/6 for real deployment yet, 

224

period.

225

226

2) RAID levels work QUITE a bit differently on btrfs.  In particular, 

227

what btrfs calls raid1 mode (with the same applying to raid10) is simply 

228

two-way-mirroring, NO MATTER THE NUMBER OF DEVICES.  There's no multi-way 

229

mirroring yet available, unless you're willing to apply not-yet-

230

mainstreamed patches.  It's planned, but not yet applied.  The roadmap 

231

says it'll happen after raid5/6 are introduced (they have been, but 

232

aren't yet really finished including power-loss-recovery, etc), so I'm 

233

guessing 3.12 at the earliest as I think 3.11 is still focused on raid5/6 

234

completion.

235

236

3) Btrfs raid1 mode is used to provide second-source for its data 

237

integrity feature as well, such that if one copy's checksum doesn't 

238

verify, it'll try the other one.  Unfortunately #2 means there's only the 

239

single fallback to try, but that's better than most filesystems, without 

240

data integrity at all, or if they have it, no fallback if it fails.

241

242

The combination of #2 and 3 was a bitter pill for me a year ago, when I 

243

was still running on aging spinning rust, and thus didn't trust two-copy-

244

only redundancy.  I really like the data integrity feature, but just a 

245

single backup copy was a great disappointment since I didn't trust my old 

246

hardware, and unfortunately two-copy-max remains the case for so-called 

247

raid1.  (Raid5/6 mode apparently introduces N-way copies or some such, 

248

but as I said, it's not complete yet and is EXPECTED to eat data.  N-way-

249

mirroring will build on that and is on the horizon, but it has been on 

250

the horizon and not seeming to get much closer for over a year now...)  

251

Fortunately for me, my budget is in far better shape this year, and with 

252

the dual new SSDs I purchased and with spinning rust for backup still, I 

253

trust my hardware enough now to run the 2-way-only mirroring that btrfs 

254

calls raid1 mode.

255

256

4) As mentioned above in the btrfs intro paragraph, btrfs, being a 

257

filesystem, actually knows what data is actual data, and what is safely 

258

left untracked and thus unsynced.  Thus, the read-data-in-before-writing-

259

it problem will be rather less, certainly on freshly formated disks where 

260

most existing data WILL be garbage/zeros (trimmed if on SSD, as mkfs.btrfs 

261

issues a trim command for the entire filesystem range before it creates 

262

the superblocks, etc, so empty space really /is/ zeroed).  Similarly with 

263

"slack space" that's not currently used but was used previously, as the 

264

filesystem ages -- btrfs knows that it can ignore that data too, and thus 

265

won't have to read it in to update the checksum when writing to a raid5/6 

266

mode btrfs.

267

268

5) There's various other nice btrfs features and a few caveats as well, 

269

but with the exception of anything btrfs-raid pertaining I totally forgot 

270

about, they're out of scope for this thread, which is after all, on raid, 

271

so I'll skip discussing them here.

272

273

274

So bottom line, I really recommend md/raid1 for now.  Unless you want to 

275

go md/raid10, with three-way-mirroring on the raid1 side.  AFAIK that's 

276

doable with 5 devices, but it's simpler, certainly conceptually simpler 

277

which can make a different to an admin trying to work with it, with 6.

278

279

If the data simply won't fit on the 5-way raid1 and you want to keep at 

280

least 2-device-loss protection, consider splitting it up, raid1 with 

281

three devices for the first half, then either get a sixth device to do 

282

the same with the second half, or go raid1 with two devices and put your 

283

less critical data on the second set.  Or, do the raid10 with 5 devices 

284

thing, but I'll admit that while I've read that it's possible, I don't 

285

really conceptually understand it myself, and haven't tried it, so I have 

286

no personal opinion or experience to offer on that.  But in that case I 

287

really would try to scrap up the money for a sixth device if possible, 

288

and do raid10 with 3-way redundancy 2-way-striping across the six, simply 

289

because it's easier to conceptualize and thus to properly administer.

290

291

--

292

Duncan - List replies preferred.   No HTML msgs.

293

"Every nonfree program has a lord, a master --

294

and if you use the program, he is your master."  Richard Stallman

Subject	Author
Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?	Rich Freeman <rich0@g.o>
Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?	Mark Knecht <markknecht@×××××.com>
Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?	Rich Freeman <rich0@g.o>

Gentoo Archives: gentoo-amd64

Replies

1	Mark Knecht posted on Thu, 20 Jun 2013 12:10:04 -0700 as excerpted:
2
3	> Does anyone know of info on how the starting sector number might
4	> impact RAID performance under Gentoo? The drives are WD-500G RE3 drives
5	> shown here:
6	>
7	> http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/
8	B001EMZPD0/ref=cm_cr_pr_product_top
9	>
10	> These are NOT 4k sector sized drives.
11	>
12	> Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My
13	> benchmarking seems abysmal at around 40MB/S using dd copying large
14	> files.
15	> It's higher, around 80MB/S if the file being transferred is coming from
16	> an SSD, but even 80MB/S seems slow to me. I see a LOT of wait time in
17	> top.
18	> And my 'large file' copies might not be large enough as the machine has
19	> 24GB of DRAM and I've only been copying 21GB so it's possible some of
20	> that is cached.
21
22	I /suspect/ that the problem isn't striping, tho that can be a factor,
23	but rather, your choice of raid6. Note that I personally ran md/raid-6
24	here for awhile, so I know a bit of what I'm talking about. I didn't
25	realize the full implications of what I was setting up originally, or I'd
26	have not chosen raid6 in the first place, but live and learn as they say,
27	and that I did.
28
29	General rule, raid6 is abysmal for writing and gets dramatically worse as
30	fragmentation sets in, tho reading is reasonable. The reason is that in
31	ordered to properly parity-check and write out less-than-full-stripe
32	writes, the system must effectively read-in the existing data and merge
33	it with the new data, then recalculate the parity, before writing the new
34	data AND 100% of the (two-way in raid-6) parity. Further, because raid
35	sits below the filesystem level, it knows nothing about what parts of the
36	filesystem are actually used, and must read and write the FULL data
37	stripe (perhaps minus the new data bit, I'm not sure), including parts
38	that will be empty on a freshly formatted filesystem.
39
40	So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k
41	in data across three devices, and 8k of parity across the other two
42	devices. Now you go to write a 1k file, but in ordered to do so the full
43	12k of existing data must be read in, even on an empty filesystem,
44	because the RAID doesn't know it's empty! Then the new data must be
45	merged in and new checksums created, then the full 20k must be written
46	back out, certainly the 8k of parity, but also likely the full 12k of
47	data even if most of it is simply rewrite, but almost certainly at least
48	the 4k strip on the device the new data is written to.
49
50	As I said that gets much worse as a filesystem ages, due to fragmentation
51	meaning writes are more often writes to say 3 stripe fragments instead of
52	a single whole stripe. That's what proper stride size, etc, can help
53	with, if the filesystem's reasonably fragmentation resistant, but even
54	then filesystem aging certainly won't /help/.
55
56	Reads, meanwhile, are reasonable speed (in normal non-degraded mode),
57	because on a raid6 the data is at least two-way striped (on a 4-device
58	raid, your 5-device would be three-way striped data, the other two being
59	parity of course), so you do get moderate striping read bonuses.
60
61	Then there's all that parity information available and written out at
62	every write, but it's not actually used to check the reliability of the
63	data in normal operation, only to reconstruct if a device or two goes
64	missing.
65
66	On a well laid out system, I/O to the separate drives at least shouldn't
67	interfere with each other, assuming SATA and a chipset and bus layout
68	that can handle them in parallel, not /that/ big a feat on today's
69	hardware at least as long as you're still doing "spinning rust", as the
70	mechanical drive latency is almost certainly the bottleneck there, and at
71	least that can be parallelized to a reasonable degree across the
72	individual drives.
73
74	What I ultimately came to realize here is that unless the job at hand is
75	nearly 100% read on the raid, with the caveat that you have enough space,
76	raid1 is almost certainly at least as good if not a better choice. If
77	you have the devices to support it, you can go for raid10/50/60, and a
78	raid10 across 5 devices is certainly possible with mdraid, but a straight
79	raid-6... you're generally better off with an N-way raid-1, for a couple
80	reasons.
81
82	First, md/raid1 is surprisingly, even astoundingly, good at multi-task
83	scheduling reads. So any time there's multiple I/O read tasks going on
84	(like during boot), raid1 works really well, with the scheduler
85	distributing tasks among the available devices, this minimizing seek-
86	latency. So take a 5-device raid-1, you can very likely accomplish at
87	least 5 and possibly 6 or even 7 read jobs in say 110 or 120% of the time
88	it would take to do just the longest one on a single device, almost
89	certainly well before a single device could have done the two longest
90	read jobs. This also works if there's a single task alternating reads of
91	N different files/directories, since the scheduler will again distribute
92	jobs among the devices, so say one device head stays over the directory
93	information, while another goes to read the first file, the second reads
94	another file, etc, and the heads stay where they are until they're needed
95	elsewhere so the more devices in raid1 you have the more likely it is
96	that more data read from the same location still has a head located right
97	over it and can just read it as the correct portion of the disk spins
98	underneath, instead of first seeking to the correct spot on the disk.
99
100	It's worth pointing out that in the case of parallel job read access, due
101	to this parallel read-scheduling md/raid1 can often best raid0
102	performance, despite raid0's technically better single-job thruput
103	numbers. This was something I learned by experience as well, that makes
104	sense but that I had TOTALLY not realized or calculated for in my
105	original setup, as I was running raid0 for things like the gentoo ebuild
106	tree and the kernel sources, since I didn't need redundancy for them. My
107	raid0 performance there was rather disappointing, because both portage
108	tree updates and dep calculation and the kernel build process don't
109	optimize well for thruput, which is what raid0 does, but optimize rather
110	better for parallel I/O, where raid1 shines especially for read.
111
112	Second, md/raid1 writes, because they happen in parallel with the
113	bottleneck being the spinning rust, basically occur at the speed of the
114	slowest disk. So you don't get N-way parallel job write speed, just
115	single disk speed, but it's still WAY WAY better than raid6, which
116	has to read in the existing data and do the merge before it can write
117	back out. THAT'S THE RAID6 PERFORMANCE KILLER, or at least it was
118	for me, effectively giving you half-device speed writes because the data
119	in too many cases must be read in first before it can be written. Raid1
120	doesn't have that problem -- it doesn't get a write performance
121	multiplier from the N devices, but at least it doesn't get device
122	performance cut in half like raid5/6 does.
123
124	Third, the read-scheduling benefits of #1 help to a lessor extent with
125	large same-raid1 copies as well. Consider, the first block must be read
126	by one device, then written to all at the new location. The second
127	similarly, then the third, etc. But, with proper scheduling an N-way
128	raid1 doing an N-block copy has done N+1 operations on all devices at the
129	end of that N-block copy. IOW, given the memory to use as a buffer, the
130	read can be done in parallel, reading N blocks in at once, one from each
131	device, then the writes, one block at a time to all devices. So a 5-way
132	raid1 will have done 6 jobs on each of the 5 devices at the end, 1 read
133	and 5 writes, to write out 5 blocks. (In actuality due to read-ahead I
134	think it's optimally 64-k blocks per device, 16 4-k blocks on each, 320k
135	total, but that's well within the usual minimal 2MB drive buffer size,
136	and the drive will probably do that on its own if both read and write-
137	caching are on, given scheduling that forces a cache-flush only at the
138	end, not multiple times in the middle. So all the kernel has to do is be
139	sure it's not interfering by forcing untimely flushes, and the drives
140	should optimize on their own.)
141
142	Forth, back to the parity. Remember, raid5/6 has all that parity that it
143	writes out (but basically never reads in normal mode, only when degraded,
144	in ordered to reconstruct the data from the missing device(s)), but
145	doesn't actually use it for integrity checking. So while raid1 doesn't
146	have the benefit of that parity data, it's not like raid5/6 used it
147	anyway, and an N-way raid1 means even MORE missing-device protection
148	since you can lose all but one device and keep right on going as if
149	nothing happened. So a 5-way raid1 can lose 4 devices, not just the two
150	devices of a 5-way raid6 or the single device of a raid5. Yes, there's
151	the loss of parity/integrity data with raid1, BUT RAID5/6 DOESN'T USE
152	THAT DATA FOR INTEGRITY CHECKING ANYWAY, ONLY FOR RECONSTRUCTION IN THE
153	CASE OF DEVICE LOSS! So the N-way raid1 is far more redundant since you
154	have N copies of the data, not one copy plus two-way-parity-that's-never-
155	used-except-for-reconstruction.
156
157	Fifth, in the event of device loss, a raid1 continues to function at
158	normal speed, because it's simply an N-way copy with a bit of extra
159	metadata to keep track of the number of N-ways. Of course you'll lose
160	the benefit of read-parallelization that the missing device provided, and
161	you'll lose the redundancy of the missing device, but in general,
162	performance remains pretty much the same no matter how many ways it's
163	raid-1 mirrored. Contrast that with raid5/6 which is SEVERELY read
164	performance impacted by device loss, since it then must reconstruct the
165	data using the parity data, not simply read it from somewhere else, which
166	is what raid1 does.
167
168	The single down side to raid1 as opposed to raid5/6 is the loss of the
169	extra space made available by the data striping, 3*single-device-space in
170	the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space in the
171	case of raid1. Otherwise, no contest, hands down, raid1 over raid6.
172
173
174	IOW, you're seeing now exactly why raid6 and to a lessor extent raid5
175	have such terrible performance (as opposed to reliability) reputations.
176	Really, unless you simply don't have the space to make it raid1, I
177	STRONGLY urge you to try that instead. I know I was very happily
178	surprised by the results I got, and only then realized what all the
179	negativity I'd seen around raid5/6 had been about, as I really hadn't
180	understood it at all when I was doing my original research.
181
182
183	Meanwhile, Rich0 already brought up btrfs, which really does promise a
184	better solution to many of these issues than md/raid, in part due to that
185	is arguably "layering violation", but really DOES allow for some serious
186	optimizations in the multiple-drive case, because as a filesystem, it
187	DOES know what's real data and what's empty space that isn't worth
188	worrying about, and because unlike raid5/6 parity, it really DOES care
189	about data integrity, not just rebuilding in case of device failure.
190
191	So several points on btrfs:
192
193	1) It's still in heavy development. The base single-device filesystem
194	case works reasonably well now and is /almost/ stable, tho I'd still urge
195	people to keep good backups as it's simply not time tested and settled,
196	and won't be for at least a few more kernels as they're still busy with
197	the other features. Second-level raid0/raid1/raid10 is at an
198	intermediate level. Primary development and initial bug testing and
199	fixing is done but they're still working on bugs that people doing only
200	traditional single-device filesystems simply don't have to worry about.
201	Third-round raid5/6 is still very new, introduced as VERY experimental
202	only with 3.9 IIRC, and is currently EXPECTED to eat data in power-loss
203	or crash events, so it's ONLY good for preliminary testing at this point.
204
205	Thus, if you're using btrfs at all, keep good backups, and keep current,
206	even -rc (if not live-git) on the kernel, because there really are fixes
207	in every single kernel for very real corner-case problems they are still
208	coming across. But single-device is /relatively/ stable now, so provided
209	you keep good TESTED backups and are willing and able to use them if it
210	comes to it, and keep current on the kernel, go for that. And I'm
211	personally running dual-device raid-1 mode across two SSDs, at the second
212	stage deployment level. I tried that (but still on spinning rust) a year
213	ago and decided btrfs simply wasn't ready for me yet, so it has come
214	quite a way in the last year. But raid5/6 mode is still fresh third-tier
215	development, which I'd not consider usable until at LEAST 3.11 and
216	probably 3.12 or later (maybe a year from now, since it's less mature
217	than raid1 was at this point last year, but should mature a bit faster).
218
219	Takeaway: If you don't have a backup you're prepared to use, you
220	shouldn't be even THINKING about btrfs at this point, no matter WHAT type
221	of deployment you're considering. If you do, you're probably reasonably
222	safe with traditional single-device btrfs, intermediately risky/safe with
223	raid0/1/10, don't even think about raid5/6 for real deployment yet,
224	period.
225
226	2) RAID levels work QUITE a bit differently on btrfs. In particular,
227	what btrfs calls raid1 mode (with the same applying to raid10) is simply
228	two-way-mirroring, NO MATTER THE NUMBER OF DEVICES. There's no multi-way
229	mirroring yet available, unless you're willing to apply not-yet-
230	mainstreamed patches. It's planned, but not yet applied. The roadmap
231	says it'll happen after raid5/6 are introduced (they have been, but
232	aren't yet really finished including power-loss-recovery, etc), so I'm
233	guessing 3.12 at the earliest as I think 3.11 is still focused on raid5/6
234	completion.
235
236	3) Btrfs raid1 mode is used to provide second-source for its data
237	integrity feature as well, such that if one copy's checksum doesn't
238	verify, it'll try the other one. Unfortunately #2 means there's only the
239	single fallback to try, but that's better than most filesystems, without
240	data integrity at all, or if they have it, no fallback if it fails.
241
242	The combination of #2 and 3 was a bitter pill for me a year ago, when I
243	was still running on aging spinning rust, and thus didn't trust two-copy-
244	only redundancy. I really like the data integrity feature, but just a
245	single backup copy was a great disappointment since I didn't trust my old
246	hardware, and unfortunately two-copy-max remains the case for so-called
247	raid1. (Raid5/6 mode apparently introduces N-way copies or some such,
248	but as I said, it's not complete yet and is EXPECTED to eat data. N-way-
249	mirroring will build on that and is on the horizon, but it has been on
250	the horizon and not seeming to get much closer for over a year now...)
251	Fortunately for me, my budget is in far better shape this year, and with
252	the dual new SSDs I purchased and with spinning rust for backup still, I
253	trust my hardware enough now to run the 2-way-only mirroring that btrfs
254	calls raid1 mode.
255
256	4) As mentioned above in the btrfs intro paragraph, btrfs, being a
257	filesystem, actually knows what data is actual data, and what is safely
258	left untracked and thus unsynced. Thus, the read-data-in-before-writing-
259	it problem will be rather less, certainly on freshly formated disks where
260	most existing data WILL be garbage/zeros (trimmed if on SSD, as mkfs.btrfs
261	issues a trim command for the entire filesystem range before it creates
262	the superblocks, etc, so empty space really /is/ zeroed). Similarly with
263	"slack space" that's not currently used but was used previously, as the
264	filesystem ages -- btrfs knows that it can ignore that data too, and thus
265	won't have to read it in to update the checksum when writing to a raid5/6
266	mode btrfs.
267
268	5) There's various other nice btrfs features and a few caveats as well,
269	but with the exception of anything btrfs-raid pertaining I totally forgot
270	about, they're out of scope for this thread, which is after all, on raid,
271	so I'll skip discussing them here.
272
273
274	So bottom line, I really recommend md/raid1 for now. Unless you want to
275	go md/raid10, with three-way-mirroring on the raid1 side. AFAIK that's
276	doable with 5 devices, but it's simpler, certainly conceptually simpler
277	which can make a different to an admin trying to work with it, with 6.
278
279	If the data simply won't fit on the 5-way raid1 and you want to keep at
280	least 2-device-loss protection, consider splitting it up, raid1 with
281	three devices for the first half, then either get a sixth device to do
282	the same with the second half, or go raid1 with two devices and put your
283	less critical data on the second set. Or, do the raid10 with 5 devices
284	thing, but I'll admit that while I've read that it's possible, I don't
285	really conceptually understand it myself, and haven't tried it, so I have
286	no personal opinion or experience to offer on that. But in that case I
287	really would try to scrap up the money for a sixth device if possible,
288	and do raid10 with 3-way redundancy 2-way-striping across the six, simply
289	because it's easier to conceptualize and thus to properly administer.
290
291	--
292	Duncan - List replies preferred. No HTML msgs.
293	"Every nonfree program has a lord, a master --
294	and if you use the program, he is your master." Richard Stallman