Gentoo Archives: gentoo-dev

From: Richard Yao <ryao@g.o>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] New distfile mirror layout
Date: Mon, 21 Oct 2019 16:42:45
Message-Id: F5C72C3C-3264-43F4-962B-5A89F0E33A8E@gentoo.org
In Reply to: Re: [gentoo-dev] New distfile mirror layout by "Michał Górny"
1 > On Oct 20, 2019, at 2:51 AM, Michał Górny <mgorny@g.o> wrote:
2 >
3 > On Sat, 2019-10-19 at 19:24 -0400, Joshua Kinard wrote:
4 >>> On 10/18/2019 09:41, Michał Górny wrote:
5 >>> Hi, everybody.
6 >>>
7 >>> It is my pleasure to announce that yesterday (EU) evening we've switched
8 >>> to a new distfile mirror layout. Users will be switching to the new
9 >>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
10 >>> already -- as their caches expire (24hrs).
11 >>>
12 >>> The new layout is mostly a bow towards mirror admins, for some of whom
13 >>> having a 60000+ files in a single directory have been a problem.
14 >>> However, I suppose some of you also found e.g. the directory index
15 >>> hardly usable due to its size.
16 >>>
17 >>> Throughout a transitional period (whose exact length hasn't been decided
18 >>> yet), both layouts will be available. Afterwards, the old layout will
19 >>> be removed from mirrors. This has a few implications:
20 >>>
21 >>> 1. Users who don't upgrade their package managers in time will lose
22 >>> the ability of fetching from Gentoo mirrors. This shouldn't be that
23 >>> much of a problem given that the core software needed to upgrade Portage
24 >>> should all have reliable upstream SRC_URIs.
25 >>>
26 >>> 2. mirror://gentoo/file URIs will stop working. While technically you
27 >>> could use mirror://gentoo/XX/file, I'd rather recommend finally
28 >>> discarding its usage and moving distfiles to devspace.
29 >>>
30 >>> 3. Directly fetching files from distfiles.gentoo.org will become
31 >>> a little harder. To fetch a distfile named 'foo-1.tar.gz', you'd have
32 >>> to use something like:
33 >>>
34 >>> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
35 >>> 1b
36 >>> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
37 >>> ...
38 >>>
39 >>>
40 >>> Alternatively, you can:
41 >>>
42 >>> $ wget http://distfiles.gentoo.org/distfiles/INDEX
43 >>>
44 >>> and grep for the right path there. This INDEX is also a more
45 >>> lightweight alternative to HTML indexes generated by the servers.
46 >>>
47 >>>
48 >>> If you're interested in more background details and some plots, see [1].
49 >>>
50 >>> [1] https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html
51 >>>
52 >>
53 >> So the answer I didn't really see directly stated here is, where do new
54 >> distfiles need to go //now//? E.g., if on woodpecker, I currently cp a
55 >> distfile to /space/distfiles-local. What is the new directory I need to
56 >> use? And if mirror://gentoo/${FOO} is going away, for the new distfiles
57 >> target, what would be the applicable prefix to use?
58 >>
59 >> Directly using devspace seems like a bad idea, IMHO. Once long ago, we all
60 >> got chastised for doing exactly that. Too much possibility of fragmentation
61 >> as devs retire or package maintainership changes hands.
62 >
63 > Today you get chastised for using /space/distfiles-local and not
64 > following policy changes. The devmanual states that it's deprecated
65 > since at least 2011, and talks of using d.g.o [1].
66 >
67 >> I looked at the whitepaper'ish-like writeup, and I kinda don't like using a
68 >> hash-based naming scheme on the new distfiles layout. I really kind prefer
69 >> breaking the directories up based on the first letter of the distfiles in
70 >> question, factoring case-sensitivity in (so you'd have 52 top-level
71 >> directories for A-Z and a-z, plus 10 more for 0-9). Under each of those
72 >> directories, additional subdirectories for the next few letters (say,
73 >> letters 2-3). Yes, this leads to some orphan cases where a distfile might
74 >> live on its own, but from a direct navigation standpoint, it's easy to find
75 >> for someone browsing the distfiles server and easy to predict where a
76 >> distfile is at.
77 >>
78 >> No math, statistical analysis, or deep-rooted knowledge of filesystems
79 >> behind that paragraph. Just a plain old unfiltered opinion. Sometimes, I
80 >> need to go get a distfile off the Gentoo mirrors, and being able to quickly
81 >> find it in the mirror root is great. Having to do hash calculations to work
82 >> out the file path will be *really* annoying.
83 >
84 > Your solution still doesn't solve the problem of having 8k-24k files
85 > in a single directory, even if you use 7 letters of prefix. So it just
86 > creates a lot of tiny directory noise for no practical gain.
87 >
88 > [1] https://devmanual.gentoo.org/general-concepts/mirrors/index.html#suitable-download-hosts
89
90 If we consider the access frequency, it might actually not be that bad. Consider a simple example with 500 files and two directory buckets. If we have 250 in each, then the size of the directory is always 250. However, if 50 files are accessed 90% of the time, then putting 450 into one directory and that 50 into another directory, we end up with the performance of the O(n) directory lookup being consistent with there being only 90 files in each directory.
91
92 I am not sure if we should be discarding all other considerations to make changes to benefit O(n) directory lookup filesystems, but if we are, then the hashing approach is not necessarily the best one. It is only the best when all files are accessed with equal frequency, which would be an incorrect assumption. A more human friendly approach might still be better. I doubt that we have the data to determine that though.
93
94 Also, another idea is to use a cheap hash function (e.g. fletcher) and just have the mirrors do the hashing behind the scenes. Then we would have the best of both worlds.
95 >
96 > --
97 > Best regards,
98 > Michał Górny
99 >

Replies

Subject Author
Re: [gentoo-dev] New distfile mirror layout Matt Turner <mattst88@g.o>
Re: [gentoo-dev] New distfile mirror layout Jaco Kroon <jaco@××××××.za>
Re: [gentoo-dev] New distfile mirror layout Rich Freeman <rich0@g.o>