Gentoo Archives: gentoo-dev

From: Jason Zaman <perfinion@g.o>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
Date: Sun, 28 Jan 2018 07:01:26
Message-Id: 20180128070111.GA17078@meriadoc.perfinion.com
In Reply to: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure by "Michał Górny"
1 On Sat, Jan 27, 2018 at 12:24:39AM +0100, Michał Górny wrote:
2 > Migrating mirrors to the hashed structure
3 > -----------------------------------------
4
5 > The hard link solution allows us to save space on the master mirror.
6 > Additionally, if ``-H`` option is used by the mirrors it avoids
7 > transferring existing files again. However, this option is known
8 > to be expensive and could cause significant server load. Without it,
9 > all mirrors need to transfer a second copy of all the existing files.
10 >
11 > The symbolic link solution could be more reliable if we could rely
12 > on mirrors using the ``--links`` rsync option. Without that, symbolic
13 > links are not transferred at all.
14
15 These rsync options might help for mirrors too:
16 --compare-dest=DIR also compare destination files relative to DIR
17 --copy-dest=DIR ... and include copies of unchanged files
18 --link-dest=DIR hardlink to files in DIR when unchanged
19
20 > Using hashed structure for local distfiles
21 > ------------------------------------------
22 > The hashed structure defined above could also be used for local distfile
23 > storage as used by the package manager. For this to work, the package
24 > manager authors need to ensure that:
25 >
26 > a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary
27 > directory where distfiles specific to the package are linked
28 > in a flat structure.
29 >
30 > b. All tools are updated to support the nested structure.
31 >
32 > c. The package manager provides a tool for users to easily manipulate
33 > distfiles, in particular to add distfiles for fetch-restricted
34 > packages into an appropriate subdirectory.
35 >
36 > For extended compatibility, the package manager may support finding
37 > distfiles in flat and nested structure simultaneously.
38
39 trying nested first then falling back to flat would make it easy for
40 users if they have to download distfiles for fetch-restricted packages
41 because then the instructions stay as "move it to
42 /usr/portage/distfiles".
43 or alternatively the tool could have a mode which will go through all
44 files in the base dir and move it to where it should be in the nested
45 tree. then you save everything to the same dir and run edist --fix
46
47 > Rationale
48 > =========
49 > Algorithm for splitting distfiles
50 > ---------------------------------
51 > In the original debate that occurred in bug #534528 [#BUG534528]_,
52 > three possible solutions for splitting distfiles were listed:
53 >
54 > a. using initial portion of filename,
55 >
56 > b. using initial portion of file hash,
57 >
58 > c. using initial portion of filename hash.
59 >
60 > The significant advantage of the filename option was simplicity. With
61 > that solution, the users could easily determine the correct subdirectory
62 > themselves. However, it's significant disadvantage was very uneven
63 > shuffling of data. In particular, the TeΧ Live packages alone count
64 > almost 23500 distfiles and all use a common prefix, making it impossible
65 > to split them further.
66
67 the filename is the original upstream or the renamed one? eg
68 SRC_URI="http://foo/foo.tar -> bar.tar" it will be bar.tar?
69
70 I think im in favour of using the initial part of the filename anyway.
71 sure its not balanced but its still a hell of a lot more balanced than
72 today and its really easy.
73
74 Another thing im wondering is if we can just use the same dir layout as
75 the packages themselves. that would fix texlive since it has a whole lot
76 of separate packages. eg /usr/portage/distfiles/app-cat/pkg/pkg-1.0.tgz
77
78 there is a problem if many packages use the same distfiles (quite
79 extensive for SELinux, every single of the sec-policy/selinux-* packages
80 has identical distfiles) so im not sure how to deal with it.
81
82 this would also make it easy in future to make the sandbox restrict
83 access to files outside of that package if we wanted to do that.
84
85 > The alternate option of using file hash has the advantage of having
86 > a more balanced split. Furthermore, since hashes are stored
87 > in Manifests using them is zero-cost. However, this solution has two
88 > significant disadvantages:
89 >
90 > 1. The hash values are unknown for newly-downloaded distfiles, so
91 > ``repoman`` (or an equivalent tool) would have to use a temporary
92 > directory before locating the file in appropriate subdirectory.
93 >
94 > 2. User-provided distfiles (e.g. for fetch-restricted packages) with
95 > hash mismatches would be placed in the wrong subdirectory,
96 > potentially causing confusing errors.
97
98 Not just this, but on principle, I also think you should be able to read
99 an ebuild and compute the url to download the file from the mirrors
100 without any extra knowledge (especially downloading the distfile).
101
102 > Using filename hashes has proven to provide a similar balance
103 > to using file hashes. Furthermore, since filenames are known up front
104 > this solution does not suffer from the both listed problems. While
105 > hashes need to be computed manually, hashing short string should not
106 > cause any performance problems.
107 >
108 > .. figure:: glep-0075-extras/by-filename.png
109 >
110 > Distribution of distfiles by first character of filenames
111 >
112 > .. figure:: glep-0075-extras/by-csum.png
113 >
114 > Distribution of distfiles by first hex-digit of checksum
115 > (x --- content checksum, + --- filename checksum)
116 >
117 > .. figure:: glep-0075-extras/by-csum2.png
118 >
119 > Distribution of distfiles by two first hex-digits of checksum
120 > (x --- content checksum, + --- filename checksum)
121
122 do you have an easy way to calculate how big the distfiles are per
123 category or cat/pkg? i'd be interested to see.
124
125 > Backwards Compatibility
126 > =======================
127 > Mirror compatibility
128 > --------------------
129 > The mirrored files are propagated to other mirrors as opaque directory
130 > structure. Therefore, there are no backwards compatibility concerns
131 > on the mirroring side.
132 >
133 > Backwards compatibility with existing clients is detailed
134 > in `migrating mirrors to the hashed structure`_ section. Backwards
135 > compatibility with the old clients will be provided by preserving
136 > the flat structure during the transitional period.
137
138 Even if there was no transition, things wouldnt be terrible because
139 portage would fall back to just downloading from SRC_URI directly
140 if the mirrors fail.

Replies