1 |
Here's an updated version. I've tried to incorporate most |
2 |
of the feedback so far. |
3 |
|
4 |
|
5 |
--- |
6 |
GLEP: 75 |
7 |
Title: Split distfile mirror directory structure |
8 |
Author: Michał Górny <mgorny@g.o>, |
9 |
Robin H. Johnson <robbat2@g.o> |
10 |
Type: Standards Track |
11 |
Status: Draft |
12 |
Version: 1 |
13 |
Created: 2018-01-26 |
14 |
Last-Modified: 2018-01-27 |
15 |
Post-History: 2018-01-27 |
16 |
Content-Type: text/x-rst |
17 |
--- |
18 |
|
19 |
Abstract |
20 |
======== |
21 |
This GLEP describes the procedure for splitting the distfiles on mirrors |
22 |
into multiple directories with the goal of reducing the number of files |
23 |
in a single directory. |
24 |
|
25 |
|
26 |
Motivation |
27 |
========== |
28 |
At the moment, both the package manager and Gentoo mirrors use flat |
29 |
directory structure to store files. While this solution usually works, |
30 |
it does not scale well. Directories with large number of files usually |
31 |
have significant performance penalty, unless using filesystems |
32 |
specifically designed for that purpose. |
33 |
|
34 |
According to the Gentoo repository state at 2018-01-26 16:23, there |
35 |
was a total of 62652 unique distfiles in the repository. While |
36 |
the users realistically hit around 10% of that, distfile mirrors often |
37 |
hold even more files --- more so if old distfiles are not wiped |
38 |
immediately. |
39 |
|
40 |
While all filesystems used on Linux boxes should be able to cope with |
41 |
a number that large, they may suffer a performance penalty with even |
42 |
a few thousand files. Additionally, if mirrors enable directory indexes |
43 |
then generating the index imposes both a significant server overhead |
44 |
and a significant data transfer. At this moment, the index |
45 |
of distfiles.gentoo.org has around 17 MiB. |
46 |
|
47 |
Splitting the distfiles into multiple directories makes it possible |
48 |
to avoid those problems by reducing the number of files in a single |
49 |
directory. For example, splitting the forementioned set of distfiles |
50 |
into 16 directories that are roughly balanced allows to reduce |
51 |
the number of files in a single directory to around 4000. Splitting |
52 |
them further into 256 directories (16x16) results in 200-300 files |
53 |
per directory which should avoid any performance problems long-term, |
54 |
even assuming 300% growth of number of distfiles. |
55 |
|
56 |
|
57 |
Specification |
58 |
============= |
59 |
Mirror layout file |
60 |
------------------ |
61 |
A mirror adhering to this specification should include a ``layout.conf`` |
62 |
file in the top distfile directory. This file uses the format |
63 |
derived from the freedesktop Desktop Entry Specification file format |
64 |
[#DESKTOP_FORMAT]_. |
65 |
|
66 |
Before using each Gentoo mirror, the package manager should attempt |
67 |
to fetch (update) its ``layout.conf`` file and process it to determine |
68 |
how to use the mirror. If the file is not present, the package manager |
69 |
should behave as if it were empty. |
70 |
|
71 |
The package manager should recognize the sections and keys listed below. |
72 |
It should ignore any unrecognized sections or keys --- the format |
73 |
is intended to account for future extensions. |
74 |
|
75 |
This specification currently defines one section: ``[structure]``. |
76 |
This section defines one or more repository structure definitions |
77 |
using non-negative sequential integer keys. The definition with |
78 |
the ``0`` key is the most preferred structure. The package manager |
79 |
should ignore any formats it does not recognize. If this section |
80 |
is not present, the package manager should behave as if only ``flat`` |
81 |
structure were specified. |
82 |
|
83 |
The following structure definitions are supported: |
84 |
|
85 |
* ``flat`` to indicate the traditional flat structure where all |
86 |
distfiles are located in the top directory, |
87 |
|
88 |
* ``filename-hash <algorithm> <cutoffs>`` to indicate the `filename |
89 |
hash structure`_ explained below. |
90 |
|
91 |
|
92 |
Filename hash structure |
93 |
----------------------- |
94 |
When using the filename hash structure, the distfiles are split |
95 |
into directories whose names are derived from the hash of distfile |
96 |
filename. This structure has two parameters: *algorithm name* |
97 |
and *cutoffs* list. |
98 |
|
99 |
The algorithm name must correspond to a valid Manifest hash name. |
100 |
An informational list of hashes is included in GLEP 74 [#GLEP74]_, |
101 |
and the policies for introducing new hashes are covered by GLEP 59 |
102 |
[#GLEP59]_. |
103 |
|
104 |
The cutoffs list specifies one or more integers separated by colons |
105 |
(``:``), indicating the number of bits (starting with the most |
106 |
significant bit) of the hash used to form subsequent subdirectory names. |
107 |
For example, the list of ``2:4`` would indicate that top-level directory |
108 |
names are formed using 2 most significant bits of the hash (resulting |
109 |
in 2² = 4 directories), and each of this directories would have |
110 |
subdirectories formed using the next 4 bits of the hash (resulting |
111 |
in 2⁴ = 8 subdirectories each). |
112 |
|
113 |
The exact algorithm for determining the distfile location follows: |
114 |
|
115 |
1. Let the distfile filename be **F**. |
116 |
|
117 |
2. Compute the hash of **F** and store its binary value as **H**. |
118 |
|
119 |
3. For each integer **C** in cutoff list: |
120 |
|
121 |
a. Take **C** most significant bits of **H** and store them as **V**. |
122 |
|
123 |
b. Convert **V** into hexadecimal integer, left padded with zeros |
124 |
to **C/4** digits (rounded up) and append it to the path, followed |
125 |
by the path separator. |
126 |
|
127 |
c. Shift **H** left **C** bits. |
128 |
|
129 |
4. Finally, append **F** to the obtained path. |
130 |
|
131 |
In particular, note that when using nested directories |
132 |
the subdirectories do not repeat the hash bits used in parent directory. |
133 |
|
134 |
|
135 |
Migrating mirrors to the hashed structure |
136 |
----------------------------------------- |
137 |
Since all distfile mirrors sync to the master Gentoo mirror, it should |
138 |
be enough to perform all the needed changes on the master mirror |
139 |
and wait for other mirrors to sync. The following procedure |
140 |
is recommended: |
141 |
|
142 |
1. Include the initial ``layout.conf`` listing only ``flat`` layout. |
143 |
|
144 |
2. Create the new structure alongside the flat structure. Wait for |
145 |
mirrors to sync. |
146 |
|
147 |
3. Once all mirrors receive the new structure, update ``layout.conf`` |
148 |
to list the ``filename-hash`` structure. |
149 |
|
150 |
4. Once a version of Portage supporting the new structure is stable long |
151 |
enough, remove the fallback ``flat`` structure from ``layout.conf`` |
152 |
and duplicate distfiles. |
153 |
|
154 |
This implies that during the migration period the distfiles will |
155 |
be stored duplicated on the mirrors and therefore will occupy twice |
156 |
as much space. Technically, this could be avoided either by using |
157 |
hard links or symbolic links. |
158 |
|
159 |
The hard link solution allows us to save space on the master mirror. |
160 |
Additionally, if ``-H`` option is used by the mirrors it avoids |
161 |
transferring existing files again. However, this option is known |
162 |
to be expensive and could cause significant server load. Without it, |
163 |
all mirrors need to transfer a second copy of all the existing files. |
164 |
|
165 |
The symbolic link solution could be more reliable if we could rely |
166 |
on mirrors using the ``--links`` rsync option. Without that, symbolic |
167 |
links are not transferred at all. |
168 |
|
169 |
|
170 |
Using hashed structure for local distfiles |
171 |
------------------------------------------ |
172 |
The hashed structure defined above could also be used for local distfile |
173 |
storage as used by the package manager. For this to work, the package |
174 |
manager authors need to ensure that: |
175 |
|
176 |
a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary |
177 |
directory where distfiles specific to the package are linked |
178 |
in a flat structure. |
179 |
|
180 |
b. All tools are updated to support the nested structure. |
181 |
|
182 |
c. The package manager provides a tool for users to easily manipulate |
183 |
distfiles, in particular to add distfiles for fetch-restricted |
184 |
packages into an appropriate subdirectory. |
185 |
|
186 |
For extended compatibility, the package manager may support finding |
187 |
distfiles in flat and nested structure simultaneously. |
188 |
|
189 |
|
190 |
Rationale |
191 |
========= |
192 |
Algorithm for splitting distfiles |
193 |
--------------------------------- |
194 |
The possible algorithms were considered with the following goals |
195 |
in mind: |
196 |
|
197 |
- the number of files in a single directory should not exceed 1000, |
198 |
|
199 |
- the total size of files in a single directory is not considered |
200 |
relevant, |
201 |
|
202 |
- the solution should preferably be future-proof, |
203 |
|
204 |
- moving distfiles should be avoided once it is deployed. |
205 |
|
206 |
It should also be noted that at this moment the package having most |
207 |
distfiles in Gentoo at the time is dev-texlive/texlive-latexextra, |
208 |
with the number of 8556 distfiles. All of them start with a common |
209 |
prefix of ``texlive-module-``. This specific prefix is used by a total |
210 |
of 23435 distfiles. |
211 |
|
212 |
In the original debate that occurred in bug #534528 [#BUG534528]_ |
213 |
and the mailing list review of the initial version of this GLEP [#ML1]_, |
214 |
four fundamental ideas for splitting distfiles were listed: |
215 |
|
216 |
a. using initial portion of filename, |
217 |
|
218 |
b. using initial portion of file hash, |
219 |
|
220 |
c. using initial portion of filename hash, |
221 |
|
222 |
d. using package category (and package name). |
223 |
|
224 |
The initial filename idea was to use the first character of filename, |
225 |
possibly followed by a longer part which was the idea historically |
226 |
used e.g. by PyPI Python package hosting. Its main advantage is |
227 |
simplicity. The users can easily determine the correct subdirectory |
228 |
by just looking at the distfile name. Sadly, this solution is not only |
229 |
very uneven but does not solve the problem. As mentioned above, |
230 |
the TeΧ Live packages share a long common prefix that make it impossible |
231 |
to split it properly with other packages on fixed-length prefixes. |
232 |
|
233 |
This idea has been followed by an adaptive proposal by Andrew Barchuk |
234 |
[#ADAPTIVE_FILENAME]_. In this proposal, the filenames are not strictly |
235 |
mapped to groups by a common prefix but instead each group contains |
236 |
all files between two prefixes being used (like in a dictionary). |
237 |
However, it has been pointed out that while this option can provide |
238 |
very even results initially, it is impossible to predict how it would |
239 |
be affected by future distfile changes and there will be a risk of |
240 |
needing to change the groups in the future. Furthermore, it is |
241 |
relatively complex and requires explicitly listing or obtaining used |
242 |
groups. |
243 |
|
244 |
Another option was to use an initial portion of distfile hashes. Its |
245 |
main advantage is that cryptographic hash algorithms can provide |
246 |
a more balanced split with random data. Furthermore, since hashes are |
247 |
stored in Manifests using them has no cost for users. However, this |
248 |
solution has three disadvantages: |
249 |
|
250 |
1. Not all files in the distfile tree are covered by package Manifests. |
251 |
Additional files are injected into the mirrors, and those will |
252 |
not have a clearly-defined location. |
253 |
|
254 |
2. User-provided distfiles (e.g. for fetch-restricted packages) with |
255 |
hash mismatches would be placed in the wrong subdirectory, |
256 |
potentially causing confusing errors. |
257 |
|
258 |
3. The hash values are unknown for newly-downloaded distfiles, so |
259 |
``repoman`` (or an equivalent tool) would have to use a temporary |
260 |
directory before locating the file in appropriate subdirectory. |
261 |
|
262 |
Using filename hashes has proven to provide a similar balance to using |
263 |
file hashes. Furthermore, since filenames are known up front this |
264 |
solution does not suffer from the listed problems. While hashes need |
265 |
to be computed manually, hashing short string should not cause |
266 |
any performance problems. |
267 |
|
268 |
Jason Zaman has suggested to use package categories (and package names) |
269 |
[#PKGNAME]_. However, this solution has multiple problems: |
270 |
|
271 |
a. it does not solve the problem for large packages such as TeΧ Live, |
272 |
|
273 |
b. it introduces many unnecessarily small directories, |
274 |
|
275 |
c. it requires an explicit knowledge of which package distfiles |
276 |
belong to, |
277 |
|
278 |
d. it does not provide an explicit solution to the problem of distfiles |
279 |
shared by multiple packages, |
280 |
|
281 |
e. it does not provide a solution to the problem of injected distfiles. |
282 |
|
283 |
All the options considered, the filename hash solution was selected |
284 |
as one that solves all the forementioned problems while introducing |
285 |
relatively low complexity and being reasonably future-proof. |
286 |
|
287 |
.. figure:: glep-0075-extras/by-filename.png |
288 |
|
289 |
Distribution of distfiles by first character of filenames |
290 |
(note: y axis is on log scale) |
291 |
|
292 |
.. figure:: glep-0075-extras/by-csum.png |
293 |
|
294 |
Distribution of distfiles by first hex-digit of checksum |
295 |
(x --- content checksum, + --- filename checksum) |
296 |
|
297 |
.. figure:: glep-0075-extras/by-csum2.png |
298 |
|
299 |
Distribution of distfiles by two first hex-digits of checksum |
300 |
(x --- content checksum, + --- filename checksum) |
301 |
|
302 |
|
303 |
Layout file |
304 |
----------- |
305 |
The presence of control file has been suggested in the original |
306 |
discussion. Its main purpose is to let package managers cleanly handle |
307 |
the migration and detect how to correctly query the mirrors throughout |
308 |
it. Furthermore, it makes future changes easier. |
309 |
|
310 |
The format lines specifically mean to hardcode as little about |
311 |
the actual algorithm as possible. Therefore, we can easily change |
312 |
the hash used or the exact split structure without having to update |
313 |
the package managers or even provide a compatibility layout. |
314 |
|
315 |
The file is also open for future extensions to provide additional mirror |
316 |
metadata. However, no clear use for that has been determined so far. |
317 |
|
318 |
|
319 |
Hash algorithm |
320 |
-------------- |
321 |
The hash algorithm support is fully deferred to the existing code |
322 |
in the package managers that is required to handle Manifests. |
323 |
In particular, it is recommended to reuse one of the hashes that are |
324 |
used in Manifest entries at the time. This avoids code duplication |
325 |
and reuses an existing mechanism to handle hash upgrades. |
326 |
|
327 |
During the discussion, it has been pointed that this particular use case |
328 |
does not require a cryptographically strong hash and a faster algorithm |
329 |
could be used instead. However, given the short length of hashed |
330 |
strings performance is not a problem, and speed does not justify |
331 |
the resulting code duplication. |
332 |
|
333 |
It has also been pointed out that e.g. the BLAKE2 hash family provides |
334 |
the ability of creating arbitrary length hashes instead of truncating |
335 |
the standard-length hash. However, not all implementations of BLAKE2 |
336 |
support that and relying on it could reduce portability for no apparent |
337 |
gain. |
338 |
|
339 |
|
340 |
Backwards Compatibility |
341 |
======================= |
342 |
Mirror compatibility |
343 |
-------------------- |
344 |
The mirrored files are propagated to other mirrors as opaque directory |
345 |
structure. Therefore, there are no backwards compatibility concerns |
346 |
on the mirroring side. |
347 |
|
348 |
Backwards compatibility with existing clients is detailed |
349 |
in `migrating mirrors to the hashed structure`_ section. Backwards |
350 |
compatibility with the old clients will be provided by preserving |
351 |
the flat structure during the transitional period. |
352 |
|
353 |
The new clients will fetch the ``layout.conf`` file to avoid backwards |
354 |
compatibility concerns in the future. In case of hitting an old mirror, |
355 |
the package manager will default to the ``flat`` structure. |
356 |
|
357 |
|
358 |
Package manager storage compatibility |
359 |
------------------------------------- |
360 |
The exact means of preserving backwards compatibility in package manager |
361 |
storage are left to the package manager authors. However, it is |
362 |
recommended that package managers continue to support the flat layout |
363 |
even if it is no longer the default. The package manager may either |
364 |
continue to read files from this location or automatically move them |
365 |
to an appropriate subdirectory. |
366 |
|
367 |
|
368 |
Reference Implementation |
369 |
======================== |
370 |
TODO. |
371 |
|
372 |
|
373 |
References |
374 |
========== |
375 |
.. [#DESKTOP_FORMAT] Desktop Entry Specification: Basic format of the file |
376 |
(https://standards.freedesktop.org/desktop-entry-spec/latest/ar01s03.html) |
377 |
|
378 |
.. [#GLEP74] GLEP 74: Full-tree verification using Manifest files: |
379 |
Checksum algorithms (informational) |
380 |
(https://www.gentoo.org/glep/glep-0074.html#checksum-algorithms-informational) |
381 |
|
382 |
.. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications |
383 |
(https://www.gentoo.org/glep/glep-0059.html) |
384 |
|
385 |
.. [#BUG534528] Bug 534528 - distfiles should be sorted into subdirectories |
386 |
of DISTDIR |
387 |
(https://bugs.gentoo.org/534528) |
388 |
|
389 |
.. [#ML1] [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure |
390 |
(https://archives.gentoo.org/gentoo-dev/message/cfc4f8595df2edf9a25ba9ecae2463ba) |
391 |
|
392 |
.. [#ADAPTIVE_FILENAME] Andrew Barchuk's reply on 'using character ranges |
393 |
for each directory computed in a way to have the files distributed evenly' |
394 |
(https://archives.gentoo.org/gentoo-dev/message/611bdaa76be049c1d650e8995748e7b8) |
395 |
|
396 |
.. [#PKGNAME] Jason Zamal's reply including 'using the same dir layout |
397 |
as the packages themselves) |
398 |
(https://archives.gentoo.org/gentoo-dev/message/f26ed870c3a6d4ecf69a821723642975) |
399 |
|
400 |
|
401 |
Copyright |
402 |
========= |
403 |
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 |
404 |
Unported License. To view a copy of this license, visit |
405 |
http://creativecommons.org/licenses/by-sa/3.0/. |
406 |
|
407 |
-- |
408 |
Best regards, |
409 |
Michał Górny |