Gentoo Archives: gentoo-amd64

From: Mark Knecht <markknecht@×××××.com>
To: Gentoo AMD64 <gentoo-amd64@l.g.o>
Subject: Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
Date: Fri, 21 Jun 2013 17:40:55
Message-Id: CAK2H+ecNGaQ7BfAaWtLkQhg3T-pC56tejDKL6kK+qydLa8YWyg@mail.gmail.com
In Reply to: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? by Duncan <1i5t5.duncan@cox.net>
1 On Fri, Jun 21, 2013 at 12:31 AM, Duncan <1i5t5.duncan@×××.net> wrote:
2 > Mark Knecht posted on Thu, 20 Jun 2013 12:10:04 -0700 as excerpted:
3 >
4 >> Does anyone know of info on how the starting sector number might
5 >> impact RAID performance under Gentoo? The drives are WD-500G RE3 drives
6 >> shown here:
7 >>
8 >> http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/
9 > B001EMZPD0/ref=cm_cr_pr_product_top
10 >>
11 >> These are NOT 4k sector sized drives.
12 >>
13 >> Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My
14 >> benchmarking seems abysmal at around 40MB/S using dd copying large
15 >> files.
16 >> It's higher, around 80MB/S if the file being transferred is coming from
17 >> an SSD, but even 80MB/S seems slow to me. I see a LOT of wait time in
18 >> top.
19 >> And my 'large file' copies might not be large enough as the machine has
20 >> 24GB of DRAM and I've only been copying 21GB so it's possible some of
21 >> that is cached.
22 >
23 > I /suspect/ that the problem isn't striping, tho that can be a factor,
24 > but rather, your choice of raid6. Note that I personally ran md/raid-6
25 > here for awhile, so I know a bit of what I'm talking about. I didn't
26 > realize the full implications of what I was setting up originally, or I'd
27 > have not chosen raid6 in the first place, but live and learn as they say,
28 > and that I did.
29 >
30 > General rule, raid6 is abysmal for writing and gets dramatically worse as
31 > fragmentation sets in, tho reading is reasonable. The reason is that in
32 > ordered to properly parity-check and write out less-than-full-stripe
33 > writes, the system must effectively read-in the existing data and merge
34 > it with the new data, then recalculate the parity, before writing the new
35 > data AND 100% of the (two-way in raid-6) parity. Further, because raid
36 > sits below the filesystem level, it knows nothing about what parts of the
37 > filesystem are actually used, and must read and write the FULL data
38 > stripe (perhaps minus the new data bit, I'm not sure), including parts
39 > that will be empty on a freshly formatted filesystem.
40 >
41 > So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k
42 > in data across three devices, and 8k of parity across the other two
43 > devices. Now you go to write a 1k file, but in ordered to do so the full
44 > 12k of existing data must be read in, even on an empty filesystem,
45 > because the RAID doesn't know it's empty! Then the new data must be
46 > merged in and new checksums created, then the full 20k must be written
47 > back out, certainly the 8k of parity, but also likely the full 12k of
48 > data even if most of it is simply rewrite, but almost certainly at least
49 > the 4k strip on the device the new data is written to.
50 >
51 <SNIP>
52
53
54 Hi Duncan,
55 Wonderful post but much too long to carry on a conversation
56 in-line. As you sound pretty sure of your understanding/history I'll
57 assume you're right 100% of the time, but only maybe 80% of the post
58 feels right to me at this time so let's assume I have much to learn
59 and go from there. I expect that others here are in a similar
60 situation to me - they use RAID but are laboring with little hard data
61 on what different portions of the system are doing and how to optimize
62 it. I certainly feel that's true in my case. I hope this thread over
63 the near or far term future might help a bit for me and potentially
64 others.
65
66 In thinking about this issue this morning I think it's important to
67 me to get down to basics and verify as much as possible, step-by-step,
68 so that I don't layer good work on top of bad assumptions. To that
69 end, and before I move too much farther forward, let me document a few
70 things about my system and the hardware available to work with and see
71 if you, Rich, Bob, Volker or anyone else wants to chime in about what
72 is correct, not correct or a better way to use it.
73
74 Basic Machine - ASUS Rampage II Extreme motherboard (4/1/2010) + 24GB
75 DDR3 + Core i7-980x Extreme 12 core processor
76 1 SDD - 120GB SATA3 on it's own controller
77 5+ HDD - WD5002ABYS RAID Edition 3 SATA3 drives using Intel integrated
78 controllers
79
80 (NOTE: I can possibly go to a 6-drive RAID if I made some changes in
81 the box but that's for later)
82
83 According to the WD spec
84 (http://www.wdc.com/en/library/spec/2879-701281.pdf) the 500GB drives
85 sustain 113MB/S to the drive. Using hdparm I measure 107MB/S or higher
86 for all 5 drives:
87
88 c2RAID6 ~ # hdparm -tT /dev/sdb
89
90 /dev/sdb:
91 Timing cached reads: 17374 MB in 2.00 seconds = 8696.12 MB/sec
92 Timing buffered disk reads: 322 MB in 3.00 seconds = 107.20 MB/sec
93 c2RAID6 ~ #
94
95 The SDD on it's own PCI Express controller clocks in at about 250MB/S for reads.
96
97 c2RAID6 ~ # hdparm -tT /dev/sda
98
99 /dev/sda:
100 Timing cached reads: 17492 MB in 2.00 seconds = 8754.42 MB/sec
101 Timing buffered disk reads: 760 MB in 3.00 seconds = 253.28 MB/sec
102 c2RAID6 ~ #
103
104
105 TESTING: I'm using dd to test. It gives an easy to read anyway result
106 and seems to be used a lot. I can use bonnie++ or IOzone later but I
107 don't think that's necessary quite yet. Being that I have 24GB and
108 don't want cached data to effect the test speeds I do the following:
109
110 1) Using dd I created a 50GB file for copying using the following commands:
111
112 cd /mnt/fastVM
113 dd if=/dev/random of=random1 bs=1000 count=0 seek=$[1000*1000*50]
114
115 mark@c2RAID6 /VirtualMachines/bonnie $ ls -alh /mnt/fastVM/ran*
116 -rw-r--r-- 1 mark mark 47G Jun 21 07:10 /mnt/fastVM/random1
117 mark@c2RAID6 /VirtualMachines/bonnie $
118
119 2) To ensure that nothing is cached and the copies are (hopefully)
120 completely fair as root I do the following between each test:
121
122 sync
123 free -h
124 echo 3 > /proc/sys/vm/drop_caches
125 free -h
126
127 An example:
128
129 c2RAID6 ~ # sync
130 c2RAID6 ~ # free -h
131 total used free shared buffers cached
132 Mem: 23G 23G 129M 0B 8.5M 21G
133 -/+ buffers/cache: 1.6G 21G
134 Swap: 12G 0B 12G
135 c2RAID6 ~ # echo 3 > /proc/sys/vm/drop_caches
136 c2RAID6 ~ # free -h
137 total used free shared buffers cached
138 Mem: 23G 2.6G 20G 0B 884K 1.3G
139 -/+ buffers/cache: 1.3G 22G
140 Swap: 12G 0B 12G
141 c2RAID6 ~ #
142
143 3) As a first test I copy using dd the 50GB file from the SDD to the
144 RAID6. As long as reading the SDD is much faster than writing the
145 RAID6 then it should be a test of primarily the RAID6 write speed:
146
147 mark@c2RAID6 /VirtualMachines/bonnie $ dd if=/mnt/fastVM/random1 of=SDDCopy
148 97656250+0 records in
149 97656250+0 records out
150 50000000000 bytes (50 GB) copied, 339.173 s, 147 MB/s
151 mark@c2RAID6 /VirtualMachines/bonnie $
152
153 If I clear cache as above and rerun the test it's always 145-155MB/S
154
155 4) As a second test I read from the RAID6 and write back to the RAID6.
156 I see MUCH lower speeds, again repeatable:
157
158 mark@c2RAID6 /VirtualMachines/bonnie $ dd if=SDDCopy of=HDDWrite
159 97656250+0 records in
160 97656250+0 records out
161 50000000000 bytes (50 GB) copied, 1187.07 s, 42.1 MB/s
162 mark@c2RAID6 /VirtualMachines/bonnie $
163
164 5) As a final test, and just looking for problems if any, I do an SDD
165 to SDD copy which clocked in at close to 200MB/S
166
167 mark@c2RAID6 /mnt/fastVM $ dd if=random1 of=SDDCopy
168 97656250+0 records in
169 97656250+0 records out
170 50000000000 bytes (50 GB) copied, 251.105 s, 199 MB/s
171 mark@c2RAID6 /mnt/fastVM $
172
173 So, being that this RAID6 was grown yesterday from something that
174 has existed for a year or two I'm not sure of it's fragmentation, or
175 even how to determine that at this time. However it seems my problem
176 are RAID6 reads, not RAID6 writes, at least to new an probably never
177 used disk space.
178
179 I will also report more later but I can state that just using top
180 there's never much CPU usage doing this but a LOT of WAIT time when
181 reading the RAID6. It really appears the system is spinning it's
182 wheels waiting for the RAID to get data from the disk.
183
184 One place where I wanted to double check your thinking. My thought
185 is that a RAID1 will _NEVER_ outperform the hdparm -tT read speeds as
186 it has to read from three drives and make sure they are all good
187 before returning data to the user. I don't see how that could ever be
188 faster than what a single drive file system could do which for these
189 drives would be the 113MB/S WD spec number, correct? As I'm currently
190 getting 145MB/S it appears on the surface that the RAID6 is providing
191 some value, at least in these early days of use. Maybe it will degrade
192 over time though.
193
194 Comments?
195
196 Cheers,
197 Mark

Replies