1 |
On Sun, 7 Nov 2004, Carsten Lohrke wrote: |
2 |
|
3 |
> On Sunday 07 November 2004 22:39, Stuart Herbert wrote: |
4 |
>> If Portage supporting arbitrary-depth category trees, then we could |
5 |
>> organise things a lot easier. But until that happens, devs are |
6 |
>> going to have to accept the need for more directories in |
7 |
>> /usr/portage. |
8 |
> |
9 |
> I don't think an arbitrary depths would be so helpful. Most likely |
10 |
> it'd slowdown portage. How about flatten the whole beast!? The |
11 |
> categorization hasn't to be done via directories. |
12 |
|
13 |
Whyever would flat-tree be better than arbitrary-depth? |
14 |
|
15 |
When I started being more dilligent about reading the gentoo mailing |
16 |
lists, I saw a number of threads on the topic of adding sub-categories, |
17 |
and the only consistent reason that was given for not moving forward |
18 |
was, "we need to benchmark that." |
19 |
|
20 |
Initially, I saw this as a good thing, because while I already knew the |
21 |
answer, performing tests to verify what one suspects tends to be a good |
22 |
thing. However, as time went on, I continued not seeing the benchmarks. |
23 |
After a while, I became disheartened. But there were other things I was |
24 |
wanting to focus on, so I didn't get invovled. |
25 |
|
26 |
|
27 |
However, knowing the answer to the directory performance question, I |
28 |
could not let this comment alone. |
29 |
|
30 |
I've attached a benchmark script, written in perl, which will find all |
31 |
of the files in the specified directory tree(s), and then randomly |
32 |
selects [count] files (where count is either specified by the --count |
33 |
option, or 10,000), and reads the first line of each of these files. |
34 |
This script can be utilized to benchmark any directory layout methods |
35 |
that people wish to consider for Gentoo. |
36 |
|
37 |
What they will find is: for ext2 and ext3 systems, there is an optimal |
38 |
number of files per directory, performance falls linearly beyond this |
39 |
point; for reiserfs systems, it doesn't matter. |
40 |
|
41 |
I performed my own tests with this script; doing a split at the most |
42 |
obvious point (the - in the category names), I received marginally |
43 |
*improved* performance - Gentoo is already slightly over the optimal |
44 |
number of files in /usr/portage. (Don't get me started on dev-perl.) |
45 |
|
46 |
More specifically, my average time for 10,000 random file reads in |
47 |
/usr/portage (by changing to /usr/portage and using '.' as the argument |
48 |
to benchaccess) was a little over 60 seconds, although as more tests were |
49 |
performed, the Linux file cache started optimizing that result. My |
50 |
average time for the split categories, on the other hand, averaged at 55 |
51 |
seconds. |
52 |
|
53 |
|
54 |
Some people may wonder why this is - after all, to access multiple |
55 |
directory trees is clearly a lot more work. This may be true for a |
56 |
human, but the computer doesn't see it that way - under ext2 and ext3, |
57 |
it has to read all of the filenames in the directory, until it finds the |
58 |
one you're looking for. Having fewer files, but more directories, means |
59 |
that it gets to the file at each level much quicker. Each of the |
60 |
directory changes adds some time, but that's negligable compared to the |
61 |
time it takes to read through a large directory. Note that this, of |
62 |
course, assumes that sanity is maintained, and we don't have many |
63 |
categories or subcategories with fewer than a dozen packages and/or |
64 |
subcategories. |
65 |
|
66 |
Another way to think of it, it's similar to the difference between |
67 |
searches on an unsorted array, and searches on a sorted array. |
68 |
|
69 |
Anyone who wonders why reiserfs does not have an issue with either |
70 |
layout does not know what reiserfs is - it was designed specifically to |
71 |
avoid this problem. |
72 |
|
73 |
Ed |