Gentoo Archives: gentoo-dev

From: Richard Yao <ryao@g.o>
To: Jaco Kroon <jaco@××××××.za>, gentoo-dev@l.g.o
Subject: ext4 readdir performance - was Re: [gentoo-dev] New distfile mirror layout
Date: Wed, 23 Oct 2019 23:47:46
Message-Id: c23b3d2d-6164-d1ad-022c-560e203d7047@gentoo.org
In Reply to: Re: [gentoo-dev] New distfile mirror layout by Jaco Kroon
1 On 10/22/19 2:51 AM, Jaco Kroon wrote:
2 > Hi All,
3 >
4 >
5 > On 2019/10/21 18:42, Richard Yao wrote:
6 >>
7 >> If we consider the access frequency, it might actually not be that
8 >> bad. Consider a simple example with 500 files and two directory
9 >> buckets. If we have 250 in each, then the size of the directory is
10 >> always 250. However, if 50 files are accessed 90% of the time, then
11 >> putting 450 into one directory and that 50 into another directory, we
12 >> end up with the performance of the O(n) directory lookup being
13 >> consistent with there being only 90 files in each directory.
14 >>
15 >> I am not sure if we should be discarding all other considerations to
16 >> make changes to benefit O(n) directory lookup filesystems, but if we
17 >> are, then the hashing approach is not necessarily the best one. It is
18 >> only the best when all files are accessed with equal frequency, which
19 >> would be an incorrect assumption. A more human friendly approach might
20 >> still be better. I doubt that we have the data to determine that though.
21 >>
22 >> Also, another idea is to use a cheap hash function (e.g. fletcher) and
23 >> just have the mirrors do the hashing behind the scenes. Then we would
24 >> have the best of both worlds.
25 >
26 >
27 > Experience:
28 >
29 > ext4 sucks at targeting name lookups without dir_index feature (O(n)
30 > lookups - scans all entries in the folder).  With dir_index readdir
31 > performance is crap.  Pick your poison I guess.  Most of our larger
32 > filesystems (2TB+, but especially the 80TB+ ones) we've reverted to
33 > disabling dir_index as the benefit is outweighed by the crappy readdir()
34 > and glob() performance.
35 My read of the ext4 disk layout documentation is that the read operation
36 should work mostly the same way, except with a penalty from reading
37 larger directories caused by the addition of the tree's metadata and
38 from having more partially filled blocks:
39
40 https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Directory_Entries
41
42 The code itself is the same traversal code:
43
44 https://github.com/torvalds/linux/blob/v5.3/fs/ext4/dir.c#L106
45
46 However, a couple of things stand out to me here at a glance:
47
48 1. `cond_resched()` adds scheduler delay for no apparent reason.
49 `cond_resched()` is meant to be used in places where we could block
50 excessively on non-PREEMPT kernels, but this doesn't strike me as one of
51 those places. The fact that we block on disk on uncached reads naturally
52 serves the same purpose, so an explicit rescheduling point here is
53 redundant. PREEMPT kernels should perform better in readdir() on ext4 by
54 virtue of making `cond_resched()` a no-op.
55
56 2. read-ahead is implemented in a way that appears to be over-reading
57 the directory whenever the needed information is not cached. This is
58 technically read-ahead, although it is not a great way of doing it. A
59 much better way to do this would be to pipeline `readdir()` by
60 initiating asynchronous read operations in anticipation of future reads.
61
62 Both of thse should affect both variants of ext4's directories, but the
63 penalties I mentioned earlier mean that the dir_index variant would be
64 affected more.
65
66 If you have a way to benchmark things, a simple idea to evaluate would
67 be deleting the `cond_resched()` line. If we had data showing an
68 improvement, I would be happy to send a small one-line patch deleting
69 the line to Ted to get the change into mainline.
70
71 > There doesn't seem to be a real specific tip-over point, and it seems to
72 > depend a lot on RAM availability and harddrive speed (obviously).  So if
73 > dentries gets cached, disk speeds becomes less of an issue.  However, on
74 > large folders (where I typically use 10k as a value for large based on
75 > "gut feeling" and "unquantifiable experience" and "nothing scientific at
76 > all") I find that even with lots of RAM two consecutive ls commands
77 > remains terribly slow. Switch off dir_index and that becomes an order of
78 > magnitude faster.
79 >
80 > I don't have a great deal of experience with XFS, but on those systems
81 > where we do it's generally on a VM, and perceivably (again, not
82 > scientific) our experience has been that it feels slower.  Again, not
83 > scientific, just perception.
84 >
85 > I'm in support for the change.  This will bucket to 256 folders and
86 > should have a reasonably even split between folders.  If required a
87 > second layer could be introduced by using the 3rd and 4th digits of the
88 > hash for a second layer.  Any hash should be fine, it really doesn't
89 > need to be cryptographically strong, it just needs to provide a good
90 > spread and be really fast.  Generally a hash table should have a prime
91 > number of buckets to assist with hash bias, but frankly, that's over
92 > complicating the situation here.
93 >
94 > I also agree with others that it used to be easy to get distfiles as and
95 > when needed, so an alternative structure could mirror that of the
96 > portage tree itself, in other words "cat/pkg/distfile". This perhaps
97 > just shifts the issue:
98 >
99 > jkroon@plastiekpoot /usr/portage $ find . -maxdepth 1 -type d -name
100 > "*-*" | wc -l
101 > 167
102 > jkroon@plastiekpoot /usr/portage $ find *-* -maxdepth 1 -type d | wc -l
103 > 19412
104 > jkroon@plastiekpoot /usr/portage $ for i in *-*; do echo $(find $i
105 > -maxdepth 1 -type d | wc -l) $i; done | sort -g | tail -n10
106 > 347 net-misc
107 > 373 media-sound
108 > 395 media-libs
109 > 399 dev-util
110 > 505 dev-libs
111 > 528 dev-java
112 > 684 dev-haskell
113 > 690 dev-ruby
114 > 1601 dev-perl
115 > 1889 dev-python
116 >
117 > So that's average 116 sub folders under the top layer (only two over
118 > 1000), and then presumably less than 100 distfiles maximum per package? 
119 > Probably overkill but would (should) solve both the too many files per
120 > folder as well as the easy lookup by hand issue.
121 >
122 > I don't have a preference on either solution though but do agree that
123 > "easy finding of distfiles" are handy.  The INDEX mechanism is fine for me.
124 >
125 > Kind Regards,
126 >
127 > Jaco
128 >

Attachments

File name MIME type
signature.asc application/pgp-signature

Replies