Gentoo Archives: gentoo-dev

From: Ed Grimm <paranoid@××××××××××××××××××××××.org>
To: gentoo-dev@l.g.o
Subject: [gentoo-dev] Package sub-categories, directory performance, and benchmarking
Date: Mon, 08 Nov 2004 04:26:50
Message-Id: Pine.LNX.4.60.0411080401050.5623@mbeq.rq.iarg
In Reply to: Re: [gentoo-dev] [LARGE MESSAGE] Media-sound reorganization! by Carsten Lohrke
1 On Sun, 7 Nov 2004, Carsten Lohrke wrote:
2
3 > On Sunday 07 November 2004 22:39, Stuart Herbert wrote:
4 >> If Portage supporting arbitrary-depth category trees, then we could
5 >> organise things a lot easier. But until that happens, devs are
6 >> going to have to accept the need for more directories in
7 >> /usr/portage.
8 >
9 > I don't think an arbitrary depths would be so helpful. Most likely
10 > it'd slowdown portage. How about flatten the whole beast!? The
11 > categorization hasn't to be done via directories.
12
13 Whyever would flat-tree be better than arbitrary-depth?
14
15 When I started being more dilligent about reading the gentoo mailing
16 lists, I saw a number of threads on the topic of adding sub-categories,
17 and the only consistent reason that was given for not moving forward
18 was, "we need to benchmark that."
19
20 Initially, I saw this as a good thing, because while I already knew the
21 answer, performing tests to verify what one suspects tends to be a good
22 thing. However, as time went on, I continued not seeing the benchmarks.
23 After a while, I became disheartened. But there were other things I was
24 wanting to focus on, so I didn't get invovled.
25
26
27 However, knowing the answer to the directory performance question, I
28 could not let this comment alone.
29
30 I've attached a benchmark script, written in perl, which will find all
31 of the files in the specified directory tree(s), and then randomly
32 selects [count] files (where count is either specified by the --count
33 option, or 10,000), and reads the first line of each of these files.
34 This script can be utilized to benchmark any directory layout methods
35 that people wish to consider for Gentoo.
36
37 What they will find is: for ext2 and ext3 systems, there is an optimal
38 number of files per directory, performance falls linearly beyond this
39 point; for reiserfs systems, it doesn't matter.
40
41 I performed my own tests with this script; doing a split at the most
42 obvious point (the - in the category names), I received marginally
43 *improved* performance - Gentoo is already slightly over the optimal
44 number of files in /usr/portage. (Don't get me started on dev-perl.)
45
46 More specifically, my average time for 10,000 random file reads in
47 /usr/portage (by changing to /usr/portage and using '.' as the argument
48 to benchaccess) was a little over 60 seconds, although as more tests were
49 performed, the Linux file cache started optimizing that result. My
50 average time for the split categories, on the other hand, averaged at 55
51 seconds.
52
53
54 Some people may wonder why this is - after all, to access multiple
55 directory trees is clearly a lot more work. This may be true for a
56 human, but the computer doesn't see it that way - under ext2 and ext3,
57 it has to read all of the filenames in the directory, until it finds the
58 one you're looking for. Having fewer files, but more directories, means
59 that it gets to the file at each level much quicker. Each of the
60 directory changes adds some time, but that's negligable compared to the
61 time it takes to read through a large directory. Note that this, of
62 course, assumes that sanity is maintained, and we don't have many
63 categories or subcategories with fewer than a dozen packages and/or
64 subcategories.
65
66 Another way to think of it, it's similar to the difference between
67 searches on an unsorted array, and searches on a sorted array.
68
69 Anyone who wonders why reiserfs does not have an issue with either
70 layout does not know what reiserfs is - it was designed specifically to
71 avoid this problem.
72
73 Ed

Attachments

File name MIME type
benchaccess text/plain

Replies