Gentoo Archives: gentoo-dev

From:	Jaco Kroon <jaco@××××××.za>
To:	gentoo-dev@l.g.o, Richard Yao <ryao@g.o>
Subject:	Re: [gentoo-dev] New distfile mirror layout
Date:	Tue, 22 Oct 2019 06:52:50
Message-Id:	`73f461e5-d224-6aec-48be-f7e0cf8e077f@uls.co.za`
In Reply to:	Re: [gentoo-dev] New distfile mirror layout by Richard Yao

1	Hi All,
2
3
4	On 2019/10/21 18:42, Richard Yao wrote:
5	>
6	> If we consider the access frequency, it might actually not be that bad. Consider a simple example with 500 files and two directory buckets. If we have 250 in each, then the size of the directory is always 250. However, if 50 files are accessed 90% of the time, then putting 450 into one directory and that 50 into another directory, we end up with the performance of the O(n) directory lookup being consistent with there being only 90 files in each directory.
7	>
8	> I am not sure if we should be discarding all other considerations to make changes to benefit O(n) directory lookup filesystems, but if we are, then the hashing approach is not necessarily the best one. It is only the best when all files are accessed with equal frequency, which would be an incorrect assumption. A more human friendly approach might still be better. I doubt that we have the data to determine that though.
9	>
10	> Also, another idea is to use a cheap hash function (e.g. fletcher) and just have the mirrors do the hashing behind the scenes. Then we would have the best of both worlds.
11
12
13	Experience:
14
15	ext4 sucks at targeting name lookups without dir_index feature (O(n)
16	lookups - scans all entries in the folder). With dir_index readdir
17	performance is crap. Pick your poison I guess. Most of our larger
18	filesystems (2TB+, but especially the 80TB+ ones) we've reverted to
19	disabling dir_index as the benefit is outweighed by the crappy readdir()
20	and glob() performance.
21
22	There doesn't seem to be a real specific tip-over point, and it seems to
23	depend a lot on RAM availability and harddrive speed (obviously). So if
24	dentries gets cached, disk speeds becomes less of an issue. However, on
25	large folders (where I typically use 10k as a value for large based on
26	"gut feeling" and "unquantifiable experience" and "nothing scientific at
27	all") I find that even with lots of RAM two consecutive ls commands
28	remains terribly slow. Switch off dir_index and that becomes an order of
29	magnitude faster.
30
31	I don't have a great deal of experience with XFS, but on those systems
32	where we do it's generally on a VM, and perceivably (again, not
33	scientific) our experience has been that it feels slower. Again, not
34	scientific, just perception.
35
36	I'm in support for the change. This will bucket to 256 folders and
37	should have a reasonably even split between folders. If required a
38	second layer could be introduced by using the 3rd and 4th digits of the
39	hash for a second layer. Any hash should be fine, it really doesn't
40	need to be cryptographically strong, it just needs to provide a good
41	spread and be really fast. Generally a hash table should have a prime
42	number of buckets to assist with hash bias, but frankly, that's over
43	complicating the situation here.
44
45	I also agree with others that it used to be easy to get distfiles as and
46	when needed, so an alternative structure could mirror that of the
47	portage tree itself, in other words "cat/pkg/distfile". This perhaps
48	just shifts the issue:
49
50	jkroon@plastiekpoot /usr/portage $ find . -maxdepth 1 -type d -name
51	"-" \| wc -l
52	167
53	jkroon@plastiekpoot /usr/portage $ find - -maxdepth 1 -type d \| wc -l
54	19412
55	jkroon@plastiekpoot /usr/portage $ for i in -; do echo $(find $i
56	-maxdepth 1 -type d \| wc -l) $i; done \| sort -g \| tail -n10
57	347 net-misc
58	373 media-sound
59	395 media-libs
60	399 dev-util
61	505 dev-libs
62	528 dev-java
63	684 dev-haskell
64	690 dev-ruby
65	1601 dev-perl
66	1889 dev-python
67
68	So that's average 116 sub folders under the top layer (only two over
69	1000), and then presumably less than 100 distfiles maximum per package?
70	Probably overkill but would (should) solve both the too many files per
71	folder as well as the easy lookup by hand issue.
72
73	I don't have a preference on either solution though but do agree that
74	"easy finding of distfiles" are handy. The INDEX mechanism is fine for me.
75
76	Kind Regards,
77
78	Jaco

Replies

Subject	Author
Re: [gentoo-dev] New distfile mirror layout	Ulrich Mueller <ulm@g.o>
ext4 readdir performance - was Re: [gentoo-dev] New distfile mirror layout	Richard Yao <ryao@g.o>

Report Message

Find on MARC Find on Google Groups