[gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? - gentoo-amd64

From:	Duncan <1i5t5.duncan@×××.net>
To:	gentoo-amd64@l.g.o
Subject:	[gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
Date:	Sat, 22 Jun 2013 14:24:11
Message-Id:	`pan$22d39$31ef3544$39d12fd7$3d33aa30@cox.net`
In Reply to:	Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? by Mark Knecht

1

Mark Knecht posted on Fri, 21 Jun 2013 10:40:48 -0700 as excerpted:

2

3

> On Fri, Jun 21, 2013 at 12:31 AM, Duncan <1i5t5.duncan@×××.net> wrote:

4

> <SNIP>

5

>

6

> Wonderful post but much too long to carry on a conversation

7

> in-line.

8

9

FWIW... I'd have a hard time doing much of anything else, these days, no 

10

matter the size.  Otherwise, I'd be likely to forget a point.  But I do 

11

try to snip or summarize when possible.  And I do understand your choice 

12

and agree with it for you.  It's just not one I'd find workable for me... 

13

which is why I'm back to inline, here.

14

15

> As you sound pretty sure of your understanding/history I'll

16

> assume you're right 100% of the time, but only maybe 80% of the post

17

> feels right to me at this time so let's assume I have much to learn and

18

> go from there.

19

20

That's a very nice way of saying "I'll have to verify that before I can 

21

fully agree, but we'll go with it for now."  I'll have to remember it!  

22

=:^)

23

24

> In thinking about this issue this morning I think it's important to

25

> me to get down to basics and verify as much as possible, step-by-step,

26

> so that I don't layer good work on top of bad assumptions.

27

28

Extremely reasonable approach. =:^)

29

30

> Basic Machine - ASUS Rampage II Extreme motherboard (4/1/2010) + 24GB

31

> DDR3 + Core i7-980x Extreme 12 core processor

32

33

That's a very impressive base.  But as you point out elsewhere, you use 

34

it.  Multiple VMs running MS should well use use both the dozen cores and 

35

the 24 gig RAM.

36

37

As an aside, it's interesting how well your dozen cores, 24 gig RAM, fits 

38

my basic two gigs a core rule of thumb.  Obviously I'd consider that 

39

reasonably well balanced RAM/cores-wise.

40

41

> 1 SDD - 120GB SATA3 on it's own controller

42

> 5+ HDD - WD5002ABYS RAID Edition 3 SATA3 drives

43

> using Intel integrated controllers

44

>

45

> (NOTE: I can possibly go to a 6-drive RAID if I made some changes in the

46

> box but that's for later)

47

>

48

> According to the WD spec

49

> (http://www.wdc.com/en/library/spec/2879-701281.pdf) the 500GB drives

50

51

OK, single 120 gig main drive (SSD), 5 half-TB drives for the raid.

52

53

> [...] sustain 113MB/S to the drive. Using hdparm I measure 107MB/S

54

> or higher for all 5 drives [...]

55

> The SDD on it's own PCI Express controller clocks in at about 250MB/S

56

> for reads.

57

58

OK.

59

60

But there's a caveat on the measured "spinning rust" speeds.  You're 

61

effectively getting "near best case".

62

63

I suppose you're familiar with absolute velocity vs rotational velocity 

64

vs distance from center.  Think merry-go-round as a kid or crack-the-whip 

65

as a teen (or insert your own experience here).  The closer to the center 

66

you are the slower you go at the same rotational speed (RPM).  

67

Conversely, the farther from the center you are, the faster you're 

68

actually moving at the same RPM.

69

70

Rotational disk data I/O rates have a similar effect -- data toward the 

71

outside edge of the platter (beginning of the disk) is faster to read/

72

write, while data toward the inside edge (center) is slower.

73

74

Based on my own hddparm tests on partitioned drives where I knew the 

75

location of the partition, vs. the results for the drive as a whole, the 

76

speed reported for rotational drives as a whole, is the speed near the 

77

outside edge (beginning of the disk).

78

79

Thus, it'd be rather interesting to partition up one of those drives with 

80

a small partition at the beginning and another at the end, and do an 

81

hdparm -t of each, as well as of the whole disk.  I bet you'd find the 

82

one at the end reports rather lower numbers, while the report for the 

83

drive as a whole is similar to that of the partition near the beginning 

84

of the drive, much faster.

85

86

A good SSD won't have this same sort of variance, since it's SSD and the 

87

latency to any of its flash, at least as presented by the firmware which 

88

should deal with any variance as it distributes wear, should be similar.  

89

(Cheap SSDs and standard USB thumbdrive flash storage works differently, 

90

however.  Often they assume FAT and have a small amount of fast and 

91

resilient but expensive SLC flash at the beginning, where the FAT would 

92

be, with the rest of the device much slower and less resilient to rewrite 

93

but far cheaper MLC.  I was just reading about this recently as I 

94

researched my own SSDs.)

95

96

> TESTING: I'm using dd to test. It gives an easy to read anyway result

97

> and seems to be used a lot. I can use bonnie++ or IOzone later but I

98

> don't think that's necessary quite yet.

99

100

Agreed.

101

102

> Being that I have 24GB and don't

103

> want cached data to effect the test speeds I do the following:

104

>

105

> 1) Using dd I created a 50GB file for copying using the following

106

> commands:

107

>

108

> cd /mnt/fastVM

109

> dd if=/dev/random of=random1 bs=1000 count=0 seek=$[1000*1000*50]

110

111

It'd be interesting to see what the reported speed is here...  See below 

112

for more.

113

114

> 2) To ensure that nothing is cached and the copies are (hopefully)

115

> completely fair as root I do the following between each test:

116

>

117

> sync free -h

118

> echo 3 > /proc/sys/vm/drop_caches

119

> free -h

120

121

Good job. =:^)

122

123

> 3) As a first test I copy using dd the 50GB file from the SDD to the

124

> RAID6.

125

126

OK, that answered the question I had about where that file you created 

127

actually was -- on the SSD.

128

129

> As long as reading the SDD is much faster than writing the RAID6

130

> then it should be a test of primarily the RAID6 write speed:

131

>

132

> dd if=/mnt/fastVM/random1 of=SDDCopy

133

> 97656250+0 records in 97656250+0 records out

134

> 50000000000 bytes (50 GB) copied, 339.173 s, 147 MB/s

135

136

> If I clear cache as above and rerun the test it's always 145-155MB/S

137

138

... Assuming $PWD is now on the raid.  You had the path shown too, which 

139

I snipped, but that doesn't tell /me/ (as opposed to you, who should know 

140

based on your mounts) anything about whether it's on the raid or not.  

141

However, the above including the drop-caches demonstrates enough care 

142

that I'm quite confident you'd not make /that/ mistake.

143

144

> 4) As a second test I read from the RAID6 and write back to the RAID6.

145

> I see MUCH lower speeds, again repeatable:

146

>

147

> dd if=SDDCopy of=HDDWrite

148

> 97656250+0 records in 97656250+0 records out

149

> 50000000000 bytes (50 GB) copied, 1187.07 s, 42.1 MB/s

150

151

> 5) As a final test, and just looking for problems if any, I do an SDD to

152

> SDD copy which clocked in at close to 200MB/S

153

>

154

> dd if=random1 of=SDDCopy

155

> 97656250+0 records in 97656250+0 records out

156

> 50000000000 bytes (50 GB) copied, 251.105 s, 199 MB/s

157

158

> So, being that this RAID6 was grown yesterday from something that

159

> has existed for a year or two I'm not sure of it's fragmentation, or

160

> even how to determine that at this time. However it seems my problem are

161

> RAID6 reads, not RAID6 writes, at least to new an probably never used

162

> disk space.

163

164

Reading all that, one question occurs to me.  If you want to test read 

165

and write separately, why the intermediate step of dd-ing from /dev/

166

random to ssd, then from ssd to raid or ssd?

167

168

Why not do direct dd if=/dev/random (or urandom, see note below)

169

of=/desired/target ... for write tests, and then (after dropping caches), 

170

if=/desired/target of=/dev/null ... for read tests?  That way there's 

171

just the one block device involved, not both.

172

173

/dev/random note:  I presume with that hardware you have one of the newer 

174

CPUs with the new Intel hardware random instruction, with the appropriate 

175

kernel config hooking it into /dev/random, and/or otherwise have 

176

/dev/random hooked up to a hardware random number generator.  Otherwise, 

177

using that much random data could block until more suitably random data 

178

is generated from approved kernel sources.  Thus, the following probably 

179

doesn't apply to you, but it may well apply to others, and is good 

180

practice in any case, unless you KNOW your random isn't going to block 

181

due to hardware generation, and even then it's worth noting that when 

182

you're posting examples like the above.

183

184

In general, for tests such as this where a LOT of random data is needed, 

185

but cryptographic-quality random isn't necessarily required, use

186

/dev/urandom.  In the event that real-random data gets too low,

187

/dev/urandom will switch to pseudo-random generation, which should be 

188

"good enough" for this sort of usage.  /dev/random, OTOH, will block 

189

until it gets more random data from sources the kernel trusts to be truly 

190

random.  On some machines with relatively limited sources of randomness 

191

the kernel considers truly random, therefore, just grabbing 50 GB of data 

192

from /dev/random could take QUITE some time (days maybe?  I don't know).

193

194

Obviously you don't have /too/ big a problem with it as you got the data 

195

from /dev/random, but it's worth noting.  If your machine has a hardware 

196

random generator hooked into /dev/random, then /dev/urandom will never 

197

switch to pseudo-random in any case, so for tests of anything above 

198

/kilobytes/ of random data (and even at that...), just use urandom and 

199

you won't have to worry about it either way.  OTOH, if you're generating 

200

an SSH key or something, always use /dev/random as that needs 

201

cryptographic security level randomness, but that'll take just a few 

202

bytes of randomness, not kilobytes let alone gigabytes, and if your 

203

hardware doesn't have good randomness and it does block, wiggling your 

204

mouse around a bit (obviously assumes a local command, remote could 

205

require something other than mouse, obviously) should give it enough 

206

randomness to continue.

207

208

209

Meanwhile, dd-ing either from /dev/urandom as source, or to /dev/null as 

210

sink, with only the test-target block device as a real block device, 

211

should give you "purer" read-only and write-only tests.  In theory it 

212

shouldn't matter much given your method of testing, but as we all know, 

213

theory and reality aren't always well aligned.

214

215

216

Of course the next question follows on from the above.  I see a write to 

217

the raid, and a copy from the raid to the raid, so read/write on the 

218

raid, and a copy from the ssd to the ssd, read/write on it, but no test 

219

of from the raid read.

220

221

So

222

223

if=/dev/urandom of=/mnt/raid/target ... should give you raid write.

224

225

drop-caches

226

227

if=/mnt/raid/target of=/dev/null ... should give you raid read.

228

229

*THEN* we have good numbers on both to compare the raid read/write to.

230

231

What I suspect you'll find, unless fragmentation IS your problem, is that 

232

both read (from the raid) alone and write (to the raid) alone should be 

233

much faster than read/write (from/to the raid).

234

235

The problem with read/write is that you're on "rotating rust" hardware 

236

and there's some latency as it repositions the heads from the read 

237

location to the write location and back.

238

239

If I'm correct and that's what you find, a workaround specific to dd 

240

would be to specify a much larger block size, so it reads in far more 

241

data at once, then writes it out at once, with far fewer switches between 

242

modes.  In the above you didn't specify bs (or the separate input/output 

243

equivilents, ibs/obs respectively) at all, so it's using 512-byte 

244

blocksize defaults.

245

246

From what I know of hardware, 64KB is a standard read-ahead, so in theory 

247

you should see improvements using larger block sizes upto at LEAST that 

248

size, and on a 5-disk raid6, probably 3X that, 192KB, which should in 

249

theory do a full 64KB buffer on each of the three data drives of the 5-

250

way raid6 (the other two being parity).

251

252

I'm guessing you'll see a "knee" at the 192 KB (that's 2^10 power not 

253

10^3 power BTW) block size, and above that you might see improvement, but 

254

not near as much, since the hardware should be doing full 64KB blocks 

255

which it's optimized to.  There's likely to be another knee at the 16MB 

256

point (again, power of two, not 10), or more accurately, the 48MB point 

257

(3*16MB), since that's the size of the device hardware buffers (again, 

258

three devices worth of data-stripe, since the other two are parity, 

259

3*16MB=48MB).  Above that, theory says you'll see even less improvement, 

260

since the caches will be full and any improvement still seen should be 

261

purely that of less switches between read/write mode and thus less seeks.

262

263

But it'd be interesting to see how closely theory matches reality, 

264

there's very possibly a fly in that theoretical ointment somewhere. =:^\

265

266

Of course configurable block size is specific to dd.  Real life file 

267

transfers may well be quite a different story.  That's where the chunk 

268

size, stripe size, etc, stuff comes in, setting the defaults for the 

269

kernel for that device, and again, I'll freely admit to not knowing as 

270

much as I could in that area.

271

272

> I will also report more later but I can state that just using top

273

> there's never much CPU usage doing this but a LOT of WAIT time when

274

> reading the RAID6. It really appears the system is spinning it's wheels

275

> waiting for the RAID to get data from the disk.

276

277

When you're dealing with spinning rust, any time you have a transfer of 

278

any size (certainly GB), you WILL see high wait times.  Disks are simply 

279

SLOW.  Even SSDs are no match for system memory, tho their enough closer 

280

to help a lot and can be close enough that the bottleneck is elsewhere.  

281

(Modern SSDs saturate the SATA-600 links with thruput above 500 MByte/

282

sec, making the SATA-600 bus the bottleneck, or the 1x PCI-E 2.xlink if 

283

that's what it's running on, since they saturate at 485MByte/sec or so, 

284

tho PCI-E 3.x is double that so nearly a GByte/sec and a single SATA-600 

285

won't saturate that.  Modern DDR3 SDRAM by comparision runs 10+ GByte/sec 

286

LOW end, two orders of magnitude faster.  Numbers fresh from wikipedia, 

287

BTW.)

288

289

> One place where I wanted to double check your thinking. My thought

290

> is that a RAID1 will _NEVER_ outperform the hdparm -tT read speeds as it

291

> has to read from three drives and make sure they are all good before

292

> returning data to the user. I don't see how that could ever be faster

293

> than what a single drive file system could do which for these drives

294

> would be the 113MB/S WD spec number, correct? As I'm currently getting

295

> 145MB/S it appears on the surface that the RAID6 is providing some

296

> value, at least in these early days of use. Maybe it will degrade over

297

> time though.

298

299

As someone else already posted, that's NOT correct.  Neither raid1 nor 

300

raid6, at least the mdraid implementations, verify the data.  Raid1 

301

doesn't have parity at all, just many copies, and raid6 has parity but 

302

only uses it for rebuilds, NOT to check data integrity under normal usage 

303

-- it too simply reads the data and returns it.

304

305

What raid1 does (when it's getting short reads only one at a time) is 

306

send the request to every spindle.  The first one that returns the data 

307

wins; the others simply get their returns thrown away.

308

309

So under small-one-at-a-time reading conditions, the speed of raid1 reads 

310

should be the speed of the fastest disk in the bunch.

311

312

The raid1 read advantage is in the fact that there's often more than one 

313

read going on at once, or that the read is big enough to split up, so 

314

different spindles can be seeking to and reading different parts of the 

315

request in parallel.  (This also helps in fragmented file conditions as 

316

long as fragmentation isn't overwhelming, since a raid1 can then send 

317

different spindle heads to read the different segments in parallel, 

318

instead of reading one at a time serially, as it would have to do in a 

319

single spindle case.)

320

321

In theory, the stripes of raid6 /can/ lead to better thruput for reads.  

322

In fact, my experience both with raid6 and with raid0 demonstrates that 

323

not to be the case as often as one might expect, due either to small 

324

reads or due to fragmentation breaking up the big reads thus negating the 

325

theoretical thruput advantage of multiple stripes.

326

327

To be fair, my raid0 experience was as I mentioned earlier, with files I 

328

could easily redownload from the net, mostly the portage tree and 

329

overlays, along with the kernel git tree.  Due to the frequency of update 

330

and the fast rate of change as well as the small files, fragmentation was 

331

quite a problem, and the files were small enough I likely wouldn't have 

332

seen the full benefit of the 4-way raid0 stripes in any case, so that 

333

wasn't a best-case test scenario.  But it's what one practically puts on 

334

raid0, because it IS easily redownloaded from the net, so it DOESN'T 

335

matter that a loss of any of the raid0 component devices will kill the 

336

entire thing.

337

338

If I'd have been using the raid0 for much bigger media files, mp3s or 

339

video of megabytes in size minimum, that get saved and never changed so 

340

there's little fragmentation, I expect my raid0 experience would have 

341

been *FAR* better.  But at the same time, that's not the type of data 

342

that it generally makes SENSE to store on a raid0 without backups or 

343

redundancy of any sort, unless it's simply VDR files that if a device 

344

drops from the raid and you lose it you don't particularly care (which 

345

would make a GREAT raid0 candidate), so...

346

347

Raid6 is the stripes of raid0, plus two-way-parity.  So since the parity 

348

is ignored for reads, for them it's effectively raid0 with two less 

349

stripes then the number of devices.  Thus your 5-device raid6 is 

350

effectively a 3-device raid0 in terms of reads.  In theory, thruput for 

351

large reads done by themselves should be pretty good -- three times that 

352

of a single device.  In fact... due either to multiple jobs happening at 

353

once, or to a mix of read/write happening at once, or to fragmentation, I 

354

was disappointed, and far happier with raid1.

355

356

But your situation is indeed rather different than mine, and depending on 

357

how much writing happens in those big VM files and how the filesystem you 

358

choose handles fragmentation, you could be rather happier with raid6 than 

359

I was.

360

361

But I'd still suggest you try raid1 if the amount of data you're handling 

362

will let you.  Honestly, it surprised me how well raid1 did for me.  I 

363

wasn't prepared for that at all, and I believe that in comparison to what 

364

I was getting on raid6 is what colored my opinion of raid6 so badly.  I 

365

had NO IDEA there would be that much difference!  But your experience may 

366

indeed be different.  The only way to know is to try it.

367

368

However, one thing I either overlooked or that hasn't been posted yet is 

369

just how much data you're talking about.  You're running five 500-gig 

370

drives in raid6 now, which should give you 3*500=1500 gigs (10-power) 

371

capacity.

372

373

If it's under a third full, 500 MB (10-power), you can go raid1 with as 

374

many mirrors as you like of the five, and keep the rest of them for hot-

375

spares or whatever.

376

377

If you're running (or plan to be running) near capacity, over 2/3 full, 1 

378

TB (10-power), you really don't have much option but raid6.

379

380

If you're in between, 1/3 to 2/3 full, 500-1000 GB (10-power), then a 

381

raid10 is possible, perhaps 4-spindle with the 5th as a hot-spare.

382

383

(A spindle configured as a hot-spare is kept unused but ready for use by 

384

mdadm and the kernel.  If a spindle should drop out, the hot-spare is 

385

automatically inserted in its place and a rebuild immediately started.  

386

This narrows the danger zone during which you're degraded and at risk if 

387

further spindles drop out, because handling is automatic so you're back 

388

to full un-degraded as soon as possible.  However, it doesn't eliminate 

389

that danger zone should another one drop out during the rebuild, which is 

390

after all quite stressful on the remaining drives since due to all that 

391

reading going on, so the risk is greater during a rebuild than under 

392

normal operation.)

393

394

So if you're over 2/3 full, or expect to be in short order, there's 

395

little sense in further debate on at least /your/ raid6, as that's pretty 

396

much what you're stuck with.  (Unless you can categorize some data as 

397

more important than other, and raid it, while the other can be considered 

398

worth the risk of loss if the device goes, in which case we're back in 

399

play with other options once again.)

400

401

--

402

Duncan - List replies preferred.   No HTML msgs.

403

"Every nonfree program has a lord, a master --

404

and if you use the program, he is your master."  Richard Stallman

Gentoo Archives: gentoo-amd64

Replies