1 |
On Monday, January 29, 2018, Michał Górny <mgorny@g.o> wrote: |
2 |
> Here's an updated version. I've tried to incorporate most |
3 |
> of the feedback so far. |
4 |
> |
5 |
> |
6 |
> --- |
7 |
> GLEP: 75 |
8 |
> Title: Split distfile mirror directory structure |
9 |
> Author: Michał Górny <mgorny@g.o>, |
10 |
> Robin H. Johnson <robbat2@g.o> |
11 |
> Type: Standards Track |
12 |
> Status: Draft |
13 |
> Version: 1 |
14 |
> Created: 2018-01-26 |
15 |
> Last-Modified: 2018-01-27 |
16 |
> Post-History: 2018-01-27 |
17 |
> Content-Type: text/x-rst |
18 |
> --- |
19 |
> |
20 |
> Abstract |
21 |
> ======== |
22 |
> This GLEP describes the procedure for splitting the distfiles on mirrors |
23 |
> into multiple directories with the goal of reducing the number of files |
24 |
> in a single directory. |
25 |
> |
26 |
> |
27 |
> Motivation |
28 |
> ========== |
29 |
> At the moment, both the package manager and Gentoo mirrors use flat |
30 |
> directory structure to store files. While this solution usually works, |
31 |
> it does not scale well. Directories with large number of files usually |
32 |
> have significant performance penalty, unless using filesystems |
33 |
> specifically designed for that purpose. |
34 |
> |
35 |
> According to the Gentoo repository state at 2018-01-26 16:23, there |
36 |
> was a total of 62652 unique distfiles in the repository. While |
37 |
> the users realistically hit around 10% of that, distfile mirrors often |
38 |
> hold even more files --- more so if old distfiles are not wiped |
39 |
> immediately. |
40 |
> |
41 |
> While all filesystems used on Linux boxes should be able to cope with |
42 |
> a number that large, they may suffer a performance penalty with even |
43 |
> a few thousand files. Additionally, if mirrors enable directory indexes |
44 |
> then generating the index imposes both a significant server overhead |
45 |
> and a significant data transfer. At this moment, the index |
46 |
> of distfiles.gentoo.org has around 17 MiB. |
47 |
> |
48 |
> Splitting the distfiles into multiple directories makes it possible |
49 |
> to avoid those problems by reducing the number of files in a single |
50 |
> directory. For example, splitting the forementioned set of distfiles |
51 |
> into 16 directories that are roughly balanced allows to reduce |
52 |
> the number of files in a single directory to around 4000. Splitting |
53 |
> them further into 256 directories (16x16) results in 200-300 files |
54 |
> per directory which should avoid any performance problems long-term, |
55 |
> even assuming 300% growth of number of distfiles. |
56 |
> |
57 |
> |
58 |
> Specification |
59 |
> ============= |
60 |
> Mirror layout file |
61 |
> ------------------ |
62 |
> A mirror adhering to this specification should include a ``layout.conf`` |
63 |
> file in the top distfile directory. This file uses the format |
64 |
> derived from the freedesktop Desktop Entry Specification file format |
65 |
> [#DESKTOP_FORMAT]_. |
66 |
> |
67 |
> Before using each Gentoo mirror, the package manager should attempt |
68 |
> to fetch (update) its ``layout.conf`` file and process it to determine |
69 |
> how to use the mirror. If the file is not present, the package manager |
70 |
> should behave as if it were empty. |
71 |
> |
72 |
> The package manager should recognize the sections and keys listed below. |
73 |
> It should ignore any unrecognized sections or keys --- the format |
74 |
> is intended to account for future extensions. |
75 |
> |
76 |
> This specification currently defines one section: ``[structure]``. |
77 |
> This section defines one or more repository structure definitions |
78 |
> using non-negative sequential integer keys. The definition with |
79 |
> the ``0`` key is the most preferred structure. The package manager |
80 |
> should ignore any formats it does not recognize. If this section |
81 |
> is not present, the package manager should behave as if only ``flat`` |
82 |
> structure were specified. |
83 |
> |
84 |
> The following structure definitions are supported: |
85 |
> |
86 |
> * ``flat`` to indicate the traditional flat structure where all |
87 |
> distfiles are located in the top directory, |
88 |
> |
89 |
> * ``filename-hash <algorithm> <cutoffs>`` to indicate the `filename |
90 |
> hash structure`_ explained below. |
91 |
> |
92 |
> |
93 |
> Filename hash structure |
94 |
> ----------------------- |
95 |
> When using the filename hash structure, the distfiles are split |
96 |
> into directories whose names are derived from the hash of distfile |
97 |
> filename. This structure has two parameters: *algorithm name* |
98 |
> and *cutoffs* list. |
99 |
> |
100 |
> The algorithm name must correspond to a valid Manifest hash name. |
101 |
> An informational list of hashes is included in GLEP 74 [#GLEP74]_, |
102 |
> and the policies for introducing new hashes are covered by GLEP 59 |
103 |
> [#GLEP59]_. |
104 |
> |
105 |
> The cutoffs list specifies one or more integers separated by colons |
106 |
> (``:``), indicating the number of bits (starting with the most |
107 |
> significant bit) of the hash used to form subsequent subdirectory names. |
108 |
> For example, the list of ``2:4`` would indicate that top-level directory |
109 |
> names are formed using 2 most significant bits of the hash (resulting |
110 |
> in 2² = 4 directories), and each of this directories would have |
111 |
> subdirectories formed using the next 4 bits of the hash (resulting |
112 |
> in 2⁴ = 8 subdirectories each). |
113 |
> |
114 |
> The exact algorithm for determining the distfile location follows: |
115 |
> |
116 |
> 1. Let the distfile filename be **F**. |
117 |
> |
118 |
> 2. Compute the hash of **F** and store its binary value as **H**. |
119 |
> |
120 |
> 3. For each integer **C** in cutoff list: |
121 |
> |
122 |
> a. Take **C** most significant bits of **H** and store them as **V**. |
123 |
> |
124 |
> b. Convert **V** into hexadecimal integer, left padded with zeros |
125 |
> to **C/4** digits (rounded up) and append it to the path, followed |
126 |
> by the path separator. |
127 |
> |
128 |
> c. Shift **H** left **C** bits. |
129 |
> |
130 |
> 4. Finally, append **F** to the obtained path. |
131 |
> |
132 |
> In particular, note that when using nested directories |
133 |
> the subdirectories do not repeat the hash bits used in parent directory. |
134 |
> |
135 |
> |
136 |
> Migrating mirrors to the hashed structure |
137 |
> ----------------------------------------- |
138 |
> Since all distfile mirrors sync to the master Gentoo mirror, it should |
139 |
> be enough to perform all the needed changes on the master mirror |
140 |
> and wait for other mirrors to sync. The following procedure |
141 |
> is recommended: |
142 |
> |
143 |
> 1. Include the initial ``layout.conf`` listing only ``flat`` layout. |
144 |
> |
145 |
> 2. Create the new structure alongside the flat structure. Wait for |
146 |
> mirrors to sync. |
147 |
> |
148 |
> 3. Once all mirrors receive the new structure, update ``layout.conf`` |
149 |
> to list the ``filename-hash`` structure. |
150 |
> |
151 |
> 4. Once a version of Portage supporting the new structure is stable long |
152 |
> enough, remove the fallback ``flat`` structure from ``layout.conf`` |
153 |
> and duplicate distfiles. |
154 |
> |
155 |
> This implies that during the migration period the distfiles will |
156 |
> be stored duplicated on the mirrors and therefore will occupy twice |
157 |
> as much space. Technically, this could be avoided either by using |
158 |
> hard links or symbolic links. |
159 |
> |
160 |
> The hard link solution allows us to save space on the master mirror. |
161 |
> Additionally, if ``-H`` option is used by the mirrors it avoids |
162 |
> transferring existing files again. However, this option is known |
163 |
> to be expensive and could cause significant server load. Without it, |
164 |
> all mirrors need to transfer a second copy of all the existing files. |
165 |
> |
166 |
> The symbolic link solution could be more reliable if we could rely |
167 |
> on mirrors using the ``--links`` rsync option. Without that, symbolic |
168 |
> links are not transferred at all. |
169 |
> |
170 |
> |
171 |
> Using hashed structure for local distfiles |
172 |
> ------------------------------------------ |
173 |
> The hashed structure defined above could also be used for local distfile |
174 |
> storage as used by the package manager. For this to work, the package |
175 |
> manager authors need to ensure that: |
176 |
> |
177 |
> a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary |
178 |
> directory where distfiles specific to the package are linked |
179 |
> in a flat structure. |
180 |
> |
181 |
> b. All tools are updated to support the nested structure. |
182 |
> |
183 |
> c. The package manager provides a tool for users to easily manipulate |
184 |
> distfiles, in particular to add distfiles for fetch-restricted |
185 |
> packages into an appropriate subdirectory. |
186 |
> |
187 |
> For extended compatibility, the package manager may support finding |
188 |
> distfiles in flat and nested structure simultaneously. |
189 |
> |
190 |
> |
191 |
> Rationale |
192 |
> ========= |
193 |
> Algorithm for splitting distfiles |
194 |
> --------------------------------- |
195 |
> The possible algorithms were considered with the following goals |
196 |
> in mind: |
197 |
> |
198 |
> - the number of files in a single directory should not exceed 1000, |
199 |
> |
200 |
> - the total size of files in a single directory is not considered |
201 |
> relevant, |
202 |
> |
203 |
> - the solution should preferably be future-proof, |
204 |
> |
205 |
> - moving distfiles should be avoided once it is deployed. |
206 |
> |
207 |
> It should also be noted that at this moment the package having most |
208 |
> distfiles in Gentoo at the time is dev-texlive/texlive-latexextra, |
209 |
> with the number of 8556 distfiles. All of them start with a common |
210 |
> prefix of ``texlive-module-``. This specific prefix is used by a total |
211 |
> of 23435 distfiles. |
212 |
> |
213 |
> In the original debate that occurred in bug #534528 [#BUG534528]_ |
214 |
> and the mailing list review of the initial version of this GLEP [#ML1]_, |
215 |
> four fundamental ideas for splitting distfiles were listed: |
216 |
> |
217 |
> a. using initial portion of filename, |
218 |
> |
219 |
> b. using initial portion of file hash, |
220 |
> |
221 |
> c. using initial portion of filename hash, |
222 |
> |
223 |
> d. using package category (and package name). |
224 |
> |
225 |
> The initial filename idea was to use the first character of filename, |
226 |
> possibly followed by a longer part which was the idea historically |
227 |
> used e.g. by PyPI Python package hosting. Its main advantage is |
228 |
> simplicity. The users can easily determine the correct subdirectory |
229 |
> by just looking at the distfile name. Sadly, this solution is not only |
230 |
> very uneven but does not solve the problem. As mentioned above, |
231 |
> the TeΧ Live packages share a long common prefix that make it impossible |
232 |
> to split it properly with other packages on fixed-length prefixes. |
233 |
> |
234 |
> This idea has been followed by an adaptive proposal by Andrew Barchuk |
235 |
> [#ADAPTIVE_FILENAME]_. In this proposal, the filenames are not strictly |
236 |
> mapped to groups by a common prefix but instead each group contains |
237 |
> all files between two prefixes being used (like in a dictionary). |
238 |
> However, it has been pointed out that while this option can provide |
239 |
> very even results initially, it is impossible to predict how it would |
240 |
> be affected by future distfile changes and there will be a risk of |
241 |
> needing to change the groups in the future. Furthermore, it is |
242 |
> relatively complex and requires explicitly listing or obtaining used |
243 |
> groups. |
244 |
> |
245 |
> Another option was to use an initial portion of distfile hashes. Its |
246 |
> main advantage is that cryptographic hash algorithms can provide |
247 |
> a more balanced split with random data. Furthermore, since hashes are |
248 |
> stored in Manifests using them has no cost for users. However, this |
249 |
> solution has three disadvantages: |
250 |
> |
251 |
> 1. Not all files in the distfile tree are covered by package Manifests. |
252 |
> Additional files are injected into the mirrors, and those will |
253 |
> not have a clearly-defined location. |
254 |
> |
255 |
> 2. User-provided distfiles (e.g. for fetch-restricted packages) with |
256 |
> hash mismatches would be placed in the wrong subdirectory, |
257 |
> potentially causing confusing errors. |
258 |
> |
259 |
> 3. The hash values are unknown for newly-downloaded distfiles, so |
260 |
> ``repoman`` (or an equivalent tool) would have to use a temporary |
261 |
> directory before locating the file in appropriate subdirectory. |
262 |
> |
263 |
> Using filename hashes has proven to provide a similar balance to using |
264 |
> file hashes. Furthermore, since filenames are known up front this |
265 |
> solution does not suffer from the listed problems. While hashes need |
266 |
> to be computed manually, hashing short string should not cause |
267 |
> any performance problems. |
268 |
> |
269 |
> Jason Zaman has suggested to use package categories (and package names) |
270 |
> [#PKGNAME]_. However, this solution has multiple problems: |
271 |
> |
272 |
> a. it does not solve the problem for large packages such as TeΧ Live, |
273 |
> |
274 |
> b. it introduces many unnecessarily small directories, |
275 |
> |
276 |
> c. it requires an explicit knowledge of which package distfiles |
277 |
> belong to, |
278 |
> |
279 |
> d. it does not provide an explicit solution to the problem of distfiles |
280 |
> shared by multiple packages, |
281 |
> |
282 |
> e. it does not provide a solution to the problem of injected distfiles. |
283 |
> |
284 |
> All the options considered, the filename hash solution was selected |
285 |
> as one that solves all the forementioned problems while introducing |
286 |
> relatively low complexity and being reasonably future-proof. |
287 |
> |
288 |
> .. figure:: glep-0075-extras/by-filename.png |
289 |
> |
290 |
> Distribution of distfiles by first character of filenames |
291 |
> (note: y axis is on log scale) |
292 |
> |
293 |
> .. figure:: glep-0075-extras/by-csum.png |
294 |
> |
295 |
> Distribution of distfiles by first hex-digit of checksum |
296 |
> (x --- content checksum, + --- filename checksum) |
297 |
> |
298 |
> .. figure:: glep-0075-extras/by-csum2.png |
299 |
> |
300 |
> Distribution of distfiles by two first hex-digits of checksum |
301 |
> (x --- content checksum, + --- filename checksum) |
302 |
> |
303 |
> |
304 |
> Layout file |
305 |
> ----------- |
306 |
> The presence of control file has been suggested in the original |
307 |
> discussion. Its main purpose is to let package managers cleanly handle |
308 |
> the migration and detect how to correctly query the mirrors throughout |
309 |
> it. Furthermore, it makes future changes easier. |
310 |
> |
311 |
> The format lines specifically mean to hardcode as little about |
312 |
> the actual algorithm as possible. Therefore, we can easily change |
313 |
> the hash used or the exact split structure without having to update |
314 |
> the package managers or even provide a compatibility layout. |
315 |
> |
316 |
> The file is also open for future extensions to provide additional mirror |
317 |
> metadata. However, no clear use for that has been determined so far. |
318 |
> |
319 |
> |
320 |
> Hash algorithm |
321 |
> -------------- |
322 |
> The hash algorithm support is fully deferred to the existing code |
323 |
> in the package managers that is required to handle Manifests. |
324 |
> In particular, it is recommended to reuse one of the hashes that are |
325 |
> used in Manifest entries at the time. This avoids code duplication |
326 |
> and reuses an existing mechanism to handle hash upgrades. |
327 |
> |
328 |
> During the discussion, it has been pointed that this particular use case |
329 |
> does not require a cryptographically strong hash and a faster algorithm |
330 |
> could be used instead. However, given the short length of hashed |
331 |
> strings performance is not a problem, and speed does not justify |
332 |
> the resulting code duplication. |
333 |
> |
334 |
> It has also been pointed out that e.g. the BLAKE2 hash family provides |
335 |
> the ability of creating arbitrary length hashes instead of truncating |
336 |
> the standard-length hash. However, not all implementations of BLAKE2 |
337 |
> support that and relying on it could reduce portability for no apparent |
338 |
> gain. |
339 |
> |
340 |
> |
341 |
> Backwards Compatibility |
342 |
> ======================= |
343 |
> Mirror compatibility |
344 |
> -------------------- |
345 |
> The mirrored files are propagated to other mirrors as opaque directory |
346 |
> structure. Therefore, there are no backwards compatibility concerns |
347 |
> on the mirroring side. |
348 |
> |
349 |
> Backwards compatibility with existing clients is detailed |
350 |
> in `migrating mirrors to the hashed structure`_ section. Backwards |
351 |
> compatibility with the old clients will be provided by preserving |
352 |
> the flat structure during the transitional period. |
353 |
> |
354 |
> The new clients will fetch the ``layout.conf`` file to avoid backwards |
355 |
> compatibility concerns in the future. In case of hitting an old mirror, |
356 |
> the package manager will default to the ``flat`` structure. |
357 |
> |
358 |
> |
359 |
> Package manager storage compatibility |
360 |
> ------------------------------------- |
361 |
> The exact means of preserving backwards compatibility in package manager |
362 |
> storage are left to the package manager authors. However, it is |
363 |
> recommended that package managers continue to support the flat layout |
364 |
> even if it is no longer the default. The package manager may either |
365 |
> continue to read files from this location or automatically move them |
366 |
> to an appropriate subdirectory. |
367 |
> |
368 |
> |
369 |
> Reference Implementation |
370 |
> ======================== |
371 |
> TODO. |
372 |
> |
373 |
> |
374 |
> References |
375 |
> ========== |
376 |
> .. [#DESKTOP_FORMAT] Desktop Entry Specification: Basic format of the file |
377 |
> ( |
378 |
https://standards.freedesktop.org/desktop-entry-spec/latest/ar01s03.html) |
379 |
> |
380 |
> .. [#GLEP74] GLEP 74: Full-tree verification using Manifest files: |
381 |
> Checksum algorithms (informational) |
382 |
> ( |
383 |
https://www.gentoo.org/glep/glep-0074.html#checksum-algorithms-informational |
384 |
) |
385 |
> |
386 |
> .. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications |
387 |
> (https://www.gentoo.org/glep/glep-0059.html) |
388 |
> |
389 |
> .. [#BUG534528] Bug 534528 - distfiles should be sorted into |
390 |
subdirectories |
391 |
> of DISTDIR |
392 |
> (https://bugs.gentoo.org/534528) |
393 |
> |
394 |
> .. [#ML1] [gentoo-dev] [pre-GLEP] Split distfile mirror directory |
395 |
structure |
396 |
> ( |
397 |
https://archives.gentoo.org/gentoo-dev/message/cfc4f8595df2edf9a25ba9ecae2463ba |
398 |
) |
399 |
> |
400 |
> .. [#ADAPTIVE_FILENAME] Andrew Barchuk's reply on 'using character ranges |
401 |
> for each directory computed in a way to have the files distributed |
402 |
evenly' |
403 |
> ( |
404 |
https://archives.gentoo.org/gentoo-dev/message/611bdaa76be049c1d650e8995748e7b8 |
405 |
) |
406 |
> |
407 |
> .. [#PKGNAME] Jason Zamal's reply including 'using the same dir layout |
408 |
> as the packages themselves) |
409 |
> ( |
410 |
https://archives.gentoo.org/gentoo-dev/message/f26ed870c3a6d4ecf69a821723642975 |
411 |
) |
412 |
> |
413 |
> |
414 |
> Copyright |
415 |
> ========= |
416 |
> This work is licensed under the Creative Commons Attribution-ShareAlike |
417 |
3.0 |
418 |
> Unported License. To view a copy of this license, visit |
419 |
> http://creativecommons.org/licenses/by-sa/3.0/. |
420 |
> |
421 |
> -- |
422 |
> Best regards, |
423 |
> Michał Górny |
424 |
> |
425 |
|
426 |
It's going to be hash based? Why? I tried to follow the conversation but |
427 |
there's now close to 5 of these posts in the mailing list with different |
428 |
conversations in each. |
429 |
|
430 |
Using filename prefixes is boring and not uniform, but I feel I should |
431 |
point out that most distfile hosts are still doing fine. Microoptimizing |
432 |
this seems like wasted effort. |
433 |
|
434 |
Cheers, |
435 |
R0b0t1 |