1 |
On Sun, Feb 7, 2010 at 11:39 AM, Willie Wong <wwong@××××××××××××××.edu> wrote: |
2 |
> On Sun, Feb 07, 2010 at 08:27:46AM -0800, Mark Knecht wrote: |
3 |
>> <QUOTE> |
4 |
>> 4KB physical sectors: KNOW WHAT YOU'RE DOING! |
5 |
>> |
6 |
>> Pros: Quiet, cool-running, big cache |
7 |
>> |
8 |
>> Cons: The 4KB physical sectors are a problem waiting to happen. If you |
9 |
>> misalign your partitions, disk performance can suffer. I ran |
10 |
>> benchmarks in Linux using a number of filesystems, and I found that |
11 |
>> with most filesystems, read performance and write performance with |
12 |
>> large files didn't suffer with misaligned partitions, but writes of |
13 |
>> many small files (unpacking a Linux kernel archive) could take several |
14 |
>> times as long with misaligned partitions as with aligned partitions. |
15 |
>> WD's advice about who needs to be concerned is overly simplistic, |
16 |
>> IMHO, and it's flat-out wrong for Linux, although it's probably |
17 |
>> accurate for 90% of buyers (those who run Windows or Mac OS and use |
18 |
>> their standard partitioning tools). If you're not part of that 90%, |
19 |
>> though, and if you don't fully understand this new technology and how |
20 |
>> to handle it, buy a drive with conventional 512-byte sectors! |
21 |
>> </QUOTE> |
22 |
>> |
23 |
>> Now, I don't mind getting a bit dirty learning to use this |
24 |
>> correctly but I'm wondering what that means in a practical sense. |
25 |
>> Reading the mke2fs man page the word 'sector' doesn't come up. It's my |
26 |
>> understanding the Linux 'blocks' are groups of sectors. True? If the |
27 |
>> disk must use 4K sectors then what - the smallest block has to be 4K |
28 |
>> and I'm using 1 sector per block? It seems that ext3 doesn't support |
29 |
>> anything larger than 4K? |
30 |
> |
31 |
> The problem is not when you are making the filesystem with mke2fs, but |
32 |
> when you partitioned the disk using fdisk. I'm sure I am making some |
33 |
> small mistakes in the explanation below, but it goes something like |
34 |
> this: |
35 |
> |
36 |
> a) The harddrive with 4K sectors allows the head to efficiently |
37 |
> read/write 4K sized blocks at a time. |
38 |
> b) However, to be compatible in hardware, the harddrive allows 512B |
39 |
> sized blocks to be addressed. In reality, this means that you can |
40 |
> individually address the 8 512B-sized chunks of the 4K sized blocks, |
41 |
> but each will count as a separate operation. To illustrate: say the |
42 |
> hardware has some sector X of size 4K. It has 8 addressable slots |
43 |
> inside X1 ... X8 each of size 512B. If your OS clusters read/writes on |
44 |
> the 512B level, it will send 8 commands to read the info in those 8 |
45 |
> blocks separately. If your OS clusters in 4K, it will send one |
46 |
> command. So in the stupid analysis I give here, it will take 8 times |
47 |
> as long for the 512B addressing to read the same data, since it will |
48 |
> take 8 passes, and each time inefficiently reading only 1/8 of the |
49 |
> data required. Now in reality, drives are smarter than that: if all 8 |
50 |
> of those are sent in sequence, sometimes the drives will cluster them |
51 |
> together in one read. |
52 |
> c) A problem occurs, however, when your OS deals with 4K clusters but |
53 |
> when you make the partition, the partition is offset! Imagine the |
54 |
> physical read sectors of your disk looking like |
55 |
> |
56 |
> AAAAAAAABBBBBBBBCCCCCCCCDDDDDDDD |
57 |
> |
58 |
> but when you make your partitions, somehow you partitioned it |
59 |
> |
60 |
> ....YYYYYYYYZZZZZZZZWWWWWWWW.... |
61 |
> |
62 |
> This is possible because the drive allows addressing by 512K chunks. |
63 |
> So for some reason one of your partitions starts halfway inside a |
64 |
> physical sector. What is the problem with this? Now suppose your OS |
65 |
> sends data to be written to the ZZZZZZZZ block. If it were completely |
66 |
> aligned, the drive will just go kink-move the head to the block, and |
67 |
> overwrite it with this information. But since half of the block is |
68 |
> over the BBBB phsical sector, and half over CCCC, what the disk now |
69 |
> needs to do is to |
70 |
> |
71 |
> pass 1) read BBBBBBBB |
72 |
> pass 2) modify the second half of BBBB to match the first half of ZZZZ |
73 |
> pass 3) write BBBBBBBB |
74 |
> pass 4) read CCCCCCCC |
75 |
> pass 5) modify the first half of CCCC to match the second half of ZZZZ |
76 |
> pass 6) write CCCCCCCC |
77 |
> |
78 |
> Or what is known as a read-modify-write operation. Thus the disk |
79 |
> becomes a lot less efficient. |
80 |
> |
81 |
> ---------- |
82 |
> |
83 |
> Now, I don't know if this is the actual problem is causing your |
84 |
> performance problems. But this may be it. When you use fdisk, it |
85 |
> defaults to aligning the partition to cylinder boundaries, and use the |
86 |
> default (from ancient times) value of 63 x (512B sized) sectors per |
87 |
> track. Since 63 is not evenly divisible by 8, you see that quite |
88 |
> likely some of your partitions are not aligned to the physical sector |
89 |
> boundaries. |
90 |
> |
91 |
> If you use cfdisk, you can try to change the geometry with the command |
92 |
> g. Or you can use the command u to change the units used in the |
93 |
> partitioning to either sectors or megabytes, and make sure your |
94 |
> partition sizes are a multiple of 8 in the former, or an integer in |
95 |
> the latter. |
96 |
> |
97 |
> Again, take what I wrote with a grain of salt: this information came |
98 |
> from the research I did a little while back after reading the slashdot |
99 |
> article on this 4K switch. So being my own understanding, it may not |
100 |
> completely be correct. |
101 |
> |
102 |
> HTH, |
103 |
> |
104 |
> W |
105 |
> -- |
106 |
> Willie W. Wong wwong@××××××××××××××.edu |
107 |
> Data aequatione quotcunque fluentes quantitae involvente fluxiones invenire |
108 |
> et vice versa ~~~ I. Newton |
109 |
> |
110 |
|
111 |
Hi Willie, |
112 |
OK - it turns out if I start fdisk using the -u option it show me |
113 |
sector numbers. Looking at the original partition put on just using |
114 |
default values it had the starting sector was 63 - probably about the |
115 |
worst value it could be. As a test I blew away that partition and |
116 |
created a new one starting at 64 instead and the untar results are |
117 |
vastly improved - down to roughly 20 seconds from 8-10 minutes. That's |
118 |
roughly twice as fast as the old 120GB SATA2 drive I was using to test |
119 |
the system out while I debugged this issue. |
120 |
|
121 |
There's still some variability but there's probably other things |
122 |
running on the box - screen savers and stuff - that account for some |
123 |
of that. |
124 |
|
125 |
I'm still a little fuzzy about what happens to the extra sectors at |
126 |
the end of a track. Are they used and I pay for a little bit of |
127 |
overhead reading data off of them or are they ignored and I lose |
128 |
capacity? I think it must be the former as my partition isn't all that |
129 |
much less than 1TB. |
130 |
|
131 |
Again, many thanks to you and Volker for point this issue out. |
132 |
|
133 |
Cheers, |
134 |
Mark |
135 |
|
136 |
gandalf TestMount # fdisk -u /dev/sdb |
137 |
|
138 |
The number of cylinders for this disk is set to 121601. |
139 |
There is nothing wrong with that, but this is larger than 1024, |
140 |
and could in certain setups cause problems with: |
141 |
1) software that runs at boot time (e.g., old versions of LILO) |
142 |
2) booting and partitioning software from other OSs |
143 |
(e.g., DOS FDISK, OS/2 FDISK) |
144 |
|
145 |
Command (m for help): p |
146 |
|
147 |
Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes |
148 |
255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectors |
149 |
Units = sectors of 1 * 512 = 512 bytes |
150 |
Disk identifier: 0x67929f10 |
151 |
|
152 |
Device Boot Start End Blocks Id System |
153 |
/dev/sdb1 64 1953525167 976762552 83 Linux |
154 |
|
155 |
Command (m for help): q |
156 |
|
157 |
gandalf TestMount # df -H |
158 |
Filesystem Size Used Avail Use% Mounted on |
159 |
/dev/sda3 110G 8.6G 96G 9% / |
160 |
udev 11M 177k 11M 2% /dev |
161 |
shm 2.0G 0 2.0G 0% /dev/shm |
162 |
/dev/sdb1 985G 210M 935G 1% /mnt/TestMount |
163 |
gandalf TestMount # |
164 |
|
165 |
|
166 |
|
167 |
gandalf TestMount # mkdir usr |
168 |
gandalf TestMount # time tar xjf /portage-latest.tar.bz2 -C /mnt/TestMount/usr |
169 |
|
170 |
real 0m23.275s |
171 |
user 0m8.614s |
172 |
sys 0m2.644s |
173 |
gandalf TestMount # time rm -rf /mnt/TestMount/usr/ |
174 |
|
175 |
real 0m3.720s |
176 |
user 0m0.118s |
177 |
sys 0m1.822s |
178 |
gandalf TestMount # mkdir usr |
179 |
gandalf TestMount # time tar xjf /portage-latest.tar.bz2 -C /mnt/TestMount/usr |
180 |
|
181 |
real 0m13.828s |
182 |
user 0m8.911s |
183 |
sys 0m2.653s |
184 |
gandalf TestMount # time rm -rf /mnt/TestMount/usr/ |
185 |
|
186 |
real 0m19.718s |
187 |
user 0m0.128s |
188 |
sys 0m2.025s |
189 |
gandalf TestMount # mkdir usr |
190 |
gandalf TestMount # time tar xjf /portage-latest.tar.bz2 -C /mnt/TestMount/usr |
191 |
|
192 |
real 0m25.777s |
193 |
user 0m8.579s |
194 |
sys 0m2.660s |
195 |
gandalf TestMount # time rm -rf /mnt/TestMount/usr/ |
196 |
|
197 |
real 0m2.564s |
198 |
user 0m0.112s |
199 |
sys 0m1.805s |
200 |
gandalf TestMount # |