Gentoo Archives: gentoo-user

From: Mark Knecht <markknecht@×××××.com>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] 1-Terabyte drives - 4K sector sizes? -> bar performance so far
Date: Sun, 07 Feb 2010 22:08:48
Message-Id: 5bdc1c8b1002071342v6c81cf13gde7bcef72be5017b@mail.gmail.com
In Reply to: Re: [gentoo-user] 1-Terabyte drives - 4K sector sizes? -> bar performance so far by Willie Wong
1 On Sun, Feb 7, 2010 at 11:39 AM, Willie Wong <wwong@××××××××××××××.edu> wrote:
2 > On Sun, Feb 07, 2010 at 08:27:46AM -0800, Mark Knecht wrote:
3 >> <QUOTE>
4 >> 4KB physical sectors: KNOW WHAT YOU'RE DOING!
5 >>
6 >> Pros: Quiet, cool-running, big cache
7 >>
8 >> Cons: The 4KB physical sectors are a problem waiting to happen. If you
9 >> misalign your partitions, disk performance can suffer. I ran
10 >> benchmarks in Linux using a number of filesystems, and I found that
11 >> with most filesystems, read performance and write performance with
12 >> large files didn't suffer with misaligned partitions, but writes of
13 >> many small files (unpacking a Linux kernel archive) could take several
14 >> times as long with misaligned partitions as with aligned partitions.
15 >> WD's advice about who needs to be concerned is overly simplistic,
16 >> IMHO, and it's flat-out wrong for Linux, although it's probably
17 >> accurate for 90% of buyers (those who run Windows or Mac OS and use
18 >> their standard partitioning tools). If you're not part of that 90%,
19 >> though, and if you don't fully understand this new technology and how
20 >> to handle it, buy a drive with conventional 512-byte sectors!
21 >> </QUOTE>
22 >>
23 >>    Now, I don't mind getting a bit dirty learning to use this
24 >> correctly but I'm wondering what that means in a practical sense.
25 >> Reading the mke2fs man page the word 'sector' doesn't come up. It's my
26 >> understanding the Linux 'blocks' are groups of sectors. True? If the
27 >> disk must use 4K sectors then what - the smallest block has to be 4K
28 >> and I'm using 1 sector per block? It seems that ext3 doesn't support
29 >> anything larger than 4K?
30 >
31 > The problem is not when you are making the filesystem with mke2fs, but
32 > when you partitioned the disk using fdisk. I'm sure I am making some
33 > small mistakes in the explanation below, but it goes something like
34 > this:
35 >
36 > a) The harddrive with 4K sectors allows the head to efficiently
37 > read/write 4K sized blocks at a time.
38 > b) However, to be compatible in hardware, the harddrive allows 512B
39 > sized blocks to be addressed. In reality, this means that you can
40 > individually address the 8 512B-sized chunks of the 4K sized blocks,
41 > but each will count as a separate operation. To illustrate: say the
42 > hardware has some sector X of size 4K. It has 8 addressable slots
43 > inside X1 ... X8 each of size 512B. If your OS clusters read/writes on
44 > the 512B level, it will send 8 commands to read the info in those 8
45 > blocks separately. If your OS clusters in 4K, it will send one
46 > command. So in the stupid analysis I give here, it will take 8 times
47 > as long for the 512B addressing to read the same data, since it will
48 > take 8 passes, and each time inefficiently reading only 1/8 of the
49 > data required. Now in reality, drives are smarter than that: if all 8
50 > of those are sent in sequence, sometimes the drives will cluster them
51 > together in one read.
52 > c) A problem occurs, however, when your OS deals with 4K clusters but
53 > when you make the partition, the partition is offset! Imagine the
54 > physical read sectors of your disk looking like
55 >
56 > AAAAAAAABBBBBBBBCCCCCCCCDDDDDDDD
57 >
58 > but when you make your partitions, somehow you partitioned it
59 >
60 > ....YYYYYYYYZZZZZZZZWWWWWWWW....
61 >
62 > This is possible because the drive allows addressing by 512K chunks.
63 > So for some reason one of your partitions starts halfway inside a
64 > physical sector. What is the problem with this? Now suppose your OS
65 > sends data to be written to the ZZZZZZZZ block. If it were completely
66 > aligned, the drive will just go kink-move the head to the block, and
67 > overwrite it with this information. But since half of the block is
68 > over the BBBB phsical sector, and half over CCCC, what the disk now
69 > needs to do is to
70 >
71 > pass 1) read BBBBBBBB
72 > pass 2) modify the second half of BBBB to match the first half of ZZZZ
73 > pass 3) write BBBBBBBB
74 > pass 4) read CCCCCCCC
75 > pass 5) modify the first half of CCCC to match the second half of ZZZZ
76 > pass 6) write CCCCCCCC
77 >
78 > Or what is known as a read-modify-write operation. Thus the disk
79 > becomes a lot less efficient.
80 >
81 > ----------
82 >
83 > Now, I don't know if this is the actual problem is causing your
84 > performance problems. But this may be it. When you use fdisk, it
85 > defaults to aligning the partition to cylinder boundaries, and use the
86 > default (from ancient times) value of 63 x (512B sized) sectors per
87 > track. Since 63 is not evenly divisible by 8, you see that quite
88 > likely some of your partitions are not aligned to the physical sector
89 > boundaries.
90 >
91 > If you use cfdisk, you can try to change the geometry with the command
92 > g. Or you can use the command u to change the units used in the
93 > partitioning to either sectors or megabytes, and make sure your
94 > partition sizes are a multiple of 8 in the former, or an integer in
95 > the latter.
96 >
97 > Again, take what I wrote with a grain of salt: this information came
98 > from the research I did a little while back after reading the slashdot
99 > article on this 4K switch. So being my own understanding, it may not
100 > completely be correct.
101 >
102 > HTH,
103 >
104 > W
105 > --
106 > Willie W. Wong                                     wwong@××××××××××××××.edu
107 > Data aequatione quotcunque fluentes quantitae involvente fluxiones invenire
108 >         et vice versa   ~~~  I. Newton
109 >
110
111 Hi Willie,
112 OK - it turns out if I start fdisk using the -u option it show me
113 sector numbers. Looking at the original partition put on just using
114 default values it had the starting sector was 63 - probably about the
115 worst value it could be. As a test I blew away that partition and
116 created a new one starting at 64 instead and the untar results are
117 vastly improved - down to roughly 20 seconds from 8-10 minutes. That's
118 roughly twice as fast as the old 120GB SATA2 drive I was using to test
119 the system out while I debugged this issue.
120
121 There's still some variability but there's probably other things
122 running on the box - screen savers and stuff - that account for some
123 of that.
124
125 I'm still a little fuzzy about what happens to the extra sectors at
126 the end of a track. Are they used and I pay for a little bit of
127 overhead reading data off of them or are they ignored and I lose
128 capacity? I think it must be the former as my partition isn't all that
129 much less than 1TB.
130
131 Again, many thanks to you and Volker for point this issue out.
132
133 Cheers,
134 Mark
135
136 gandalf TestMount # fdisk -u /dev/sdb
137
138 The number of cylinders for this disk is set to 121601.
139 There is nothing wrong with that, but this is larger than 1024,
140 and could in certain setups cause problems with:
141 1) software that runs at boot time (e.g., old versions of LILO)
142 2) booting and partitioning software from other OSs
143 (e.g., DOS FDISK, OS/2 FDISK)
144
145 Command (m for help): p
146
147 Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
148 255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectors
149 Units = sectors of 1 * 512 = 512 bytes
150 Disk identifier: 0x67929f10
151
152 Device Boot Start End Blocks Id System
153 /dev/sdb1 64 1953525167 976762552 83 Linux
154
155 Command (m for help): q
156
157 gandalf TestMount # df -H
158 Filesystem Size Used Avail Use% Mounted on
159 /dev/sda3 110G 8.6G 96G 9% /
160 udev 11M 177k 11M 2% /dev
161 shm 2.0G 0 2.0G 0% /dev/shm
162 /dev/sdb1 985G 210M 935G 1% /mnt/TestMount
163 gandalf TestMount #
164
165
166
167 gandalf TestMount # mkdir usr
168 gandalf TestMount # time tar xjf /portage-latest.tar.bz2 -C /mnt/TestMount/usr
169
170 real 0m23.275s
171 user 0m8.614s
172 sys 0m2.644s
173 gandalf TestMount # time rm -rf /mnt/TestMount/usr/
174
175 real 0m3.720s
176 user 0m0.118s
177 sys 0m1.822s
178 gandalf TestMount # mkdir usr
179 gandalf TestMount # time tar xjf /portage-latest.tar.bz2 -C /mnt/TestMount/usr
180
181 real 0m13.828s
182 user 0m8.911s
183 sys 0m2.653s
184 gandalf TestMount # time rm -rf /mnt/TestMount/usr/
185
186 real 0m19.718s
187 user 0m0.128s
188 sys 0m2.025s
189 gandalf TestMount # mkdir usr
190 gandalf TestMount # time tar xjf /portage-latest.tar.bz2 -C /mnt/TestMount/usr
191
192 real 0m25.777s
193 user 0m8.579s
194 sys 0m2.660s
195 gandalf TestMount # time rm -rf /mnt/TestMount/usr/
196
197 real 0m2.564s
198 user 0m0.112s
199 sys 0m1.805s
200 gandalf TestMount #

Replies