Gentoo Archives: gentoo-dev

From: "Michał Górny" <mgorny@g.o>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure (draft v2)
Date: Mon, 29 Jan 2018 19:37:58
Message-Id: 1517254667.1187.25.camel@gentoo.org
In Reply to: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure by "Michał Górny"
1 Here's an updated version. I've tried to incorporate most
2 of the feedback so far.
3
4
5 ---
6 GLEP: 75
7 Title: Split distfile mirror directory structure
8 Author: Michał Górny <mgorny@g.o>,
9 Robin H. Johnson <robbat2@g.o>
10 Type: Standards Track
11 Status: Draft
12 Version: 1
13 Created: 2018-01-26
14 Last-Modified: 2018-01-27
15 Post-History: 2018-01-27
16 Content-Type: text/x-rst
17 ---
18
19 Abstract
20 ========
21 This GLEP describes the procedure for splitting the distfiles on mirrors
22 into multiple directories with the goal of reducing the number of files
23 in a single directory.
24
25
26 Motivation
27 ==========
28 At the moment, both the package manager and Gentoo mirrors use flat
29 directory structure to store files. While this solution usually works,
30 it does not scale well. Directories with large number of files usually
31 have significant performance penalty, unless using filesystems
32 specifically designed for that purpose.
33
34 According to the Gentoo repository state at 2018-01-26 16:23, there
35 was a total of 62652 unique distfiles in the repository. While
36 the users realistically hit around 10% of that, distfile mirrors often
37 hold even more files --- more so if old distfiles are not wiped
38 immediately.
39
40 While all filesystems used on Linux boxes should be able to cope with
41 a number that large, they may suffer a performance penalty with even
42 a few thousand files. Additionally, if mirrors enable directory indexes
43 then generating the index imposes both a significant server overhead
44 and a significant data transfer. At this moment, the index
45 of distfiles.gentoo.org has around 17 MiB.
46
47 Splitting the distfiles into multiple directories makes it possible
48 to avoid those problems by reducing the number of files in a single
49 directory. For example, splitting the forementioned set of distfiles
50 into 16 directories that are roughly balanced allows to reduce
51 the number of files in a single directory to around 4000. Splitting
52 them further into 256 directories (16x16) results in 200-300 files
53 per directory which should avoid any performance problems long-term,
54 even assuming 300% growth of number of distfiles.
55
56
57 Specification
58 =============
59 Mirror layout file
60 ------------------
61 A mirror adhering to this specification should include a ``layout.conf``
62 file in the top distfile directory. This file uses the format
63 derived from the freedesktop Desktop Entry Specification file format
64 [#DESKTOP_FORMAT]_.
65
66 Before using each Gentoo mirror, the package manager should attempt
67 to fetch (update) its ``layout.conf`` file and process it to determine
68 how to use the mirror. If the file is not present, the package manager
69 should behave as if it were empty.
70
71 The package manager should recognize the sections and keys listed below.
72 It should ignore any unrecognized sections or keys --- the format
73 is intended to account for future extensions.
74
75 This specification currently defines one section: ``[structure]``.
76 This section defines one or more repository structure definitions
77 using non-negative sequential integer keys. The definition with
78 the ``0`` key is the most preferred structure. The package manager
79 should ignore any formats it does not recognize. If this section
80 is not present, the package manager should behave as if only ``flat``
81 structure were specified.
82
83 The following structure definitions are supported:
84
85 * ``flat`` to indicate the traditional flat structure where all
86 distfiles are located in the top directory,
87
88 * ``filename-hash <algorithm> <cutoffs>`` to indicate the `filename
89 hash structure`_ explained below.
90
91
92 Filename hash structure
93 -----------------------
94 When using the filename hash structure, the distfiles are split
95 into directories whose names are derived from the hash of distfile
96 filename. This structure has two parameters: *algorithm name*
97 and *cutoffs* list.
98
99 The algorithm name must correspond to a valid Manifest hash name.
100 An informational list of hashes is included in GLEP 74 [#GLEP74]_,
101 and the policies for introducing new hashes are covered by GLEP 59
102 [#GLEP59]_.
103
104 The cutoffs list specifies one or more integers separated by colons
105 (``:``), indicating the number of bits (starting with the most
106 significant bit) of the hash used to form subsequent subdirectory names.
107 For example, the list of ``2:4`` would indicate that top-level directory
108 names are formed using 2 most significant bits of the hash (resulting
109 in 2² = 4 directories), and each of this directories would have
110 subdirectories formed using the next 4 bits of the hash (resulting
111 in 2⁴ = 8 subdirectories each).
112
113 The exact algorithm for determining the distfile location follows:
114
115 1. Let the distfile filename be **F**.
116
117 2. Compute the hash of **F** and store its binary value as **H**.
118
119 3. For each integer **C** in cutoff list:
120
121 a. Take **C** most significant bits of **H** and store them as **V**.
122
123 b. Convert **V** into hexadecimal integer, left padded with zeros
124 to **C/4** digits (rounded up) and append it to the path, followed
125 by the path separator.
126
127 c. Shift **H** left **C** bits.
128
129 4. Finally, append **F** to the obtained path.
130
131 In particular, note that when using nested directories
132 the subdirectories do not repeat the hash bits used in parent directory.
133
134
135 Migrating mirrors to the hashed structure
136 -----------------------------------------
137 Since all distfile mirrors sync to the master Gentoo mirror, it should
138 be enough to perform all the needed changes on the master mirror
139 and wait for other mirrors to sync. The following procedure
140 is recommended:
141
142 1. Include the initial ``layout.conf`` listing only ``flat`` layout.
143
144 2. Create the new structure alongside the flat structure. Wait for
145 mirrors to sync.
146
147 3. Once all mirrors receive the new structure, update ``layout.conf``
148 to list the ``filename-hash`` structure.
149
150 4. Once a version of Portage supporting the new structure is stable long
151 enough, remove the fallback ``flat`` structure from ``layout.conf``
152 and duplicate distfiles.
153
154 This implies that during the migration period the distfiles will
155 be stored duplicated on the mirrors and therefore will occupy twice
156 as much space. Technically, this could be avoided either by using
157 hard links or symbolic links.
158
159 The hard link solution allows us to save space on the master mirror.
160 Additionally, if ``-H`` option is used by the mirrors it avoids
161 transferring existing files again. However, this option is known
162 to be expensive and could cause significant server load. Without it,
163 all mirrors need to transfer a second copy of all the existing files.
164
165 The symbolic link solution could be more reliable if we could rely
166 on mirrors using the ``--links`` rsync option. Without that, symbolic
167 links are not transferred at all.
168
169
170 Using hashed structure for local distfiles
171 ------------------------------------------
172 The hashed structure defined above could also be used for local distfile
173 storage as used by the package manager. For this to work, the package
174 manager authors need to ensure that:
175
176 a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary
177 directory where distfiles specific to the package are linked
178 in a flat structure.
179
180 b. All tools are updated to support the nested structure.
181
182 c. The package manager provides a tool for users to easily manipulate
183 distfiles, in particular to add distfiles for fetch-restricted
184 packages into an appropriate subdirectory.
185
186 For extended compatibility, the package manager may support finding
187 distfiles in flat and nested structure simultaneously.
188
189
190 Rationale
191 =========
192 Algorithm for splitting distfiles
193 ---------------------------------
194 The possible algorithms were considered with the following goals
195 in mind:
196
197 - the number of files in a single directory should not exceed 1000,
198
199 - the total size of files in a single directory is not considered
200 relevant,
201
202 - the solution should preferably be future-proof,
203
204 - moving distfiles should be avoided once it is deployed.
205
206 It should also be noted that at this moment the package having most
207 distfiles in Gentoo at the time is dev-texlive/texlive-latexextra,
208 with the number of 8556 distfiles. All of them start with a common
209 prefix of ``texlive-module-``. This specific prefix is used by a total
210 of 23435 distfiles.
211
212 In the original debate that occurred in bug #534528 [#BUG534528]_
213 and the mailing list review of the initial version of this GLEP [#ML1]_,
214 four fundamental ideas for splitting distfiles were listed:
215
216 a. using initial portion of filename,
217
218 b. using initial portion of file hash,
219
220 c. using initial portion of filename hash,
221
222 d. using package category (and package name).
223
224 The initial filename idea was to use the first character of filename,
225 possibly followed by a longer part which was the idea historically
226 used e.g. by PyPI Python package hosting. Its main advantage is
227 simplicity. The users can easily determine the correct subdirectory
228 by just looking at the distfile name. Sadly, this solution is not only
229 very uneven but does not solve the problem. As mentioned above,
230 the TeΧ Live packages share a long common prefix that make it impossible
231 to split it properly with other packages on fixed-length prefixes.
232
233 This idea has been followed by an adaptive proposal by Andrew Barchuk
234 [#ADAPTIVE_FILENAME]_. In this proposal, the filenames are not strictly
235 mapped to groups by a common prefix but instead each group contains
236 all files between two prefixes being used (like in a dictionary).
237 However, it has been pointed out that while this option can provide
238 very even results initially, it is impossible to predict how it would
239 be affected by future distfile changes and there will be a risk of
240 needing to change the groups in the future. Furthermore, it is
241 relatively complex and requires explicitly listing or obtaining used
242 groups.
243
244 Another option was to use an initial portion of distfile hashes. Its
245 main advantage is that cryptographic hash algorithms can provide
246 a more balanced split with random data. Furthermore, since hashes are
247 stored in Manifests using them has no cost for users. However, this
248 solution has three disadvantages:
249
250 1. Not all files in the distfile tree are covered by package Manifests.
251 Additional files are injected into the mirrors, and those will
252 not have a clearly-defined location.
253
254 2. User-provided distfiles (e.g. for fetch-restricted packages) with
255 hash mismatches would be placed in the wrong subdirectory,
256 potentially causing confusing errors.
257
258 3. The hash values are unknown for newly-downloaded distfiles, so
259 ``repoman`` (or an equivalent tool) would have to use a temporary
260 directory before locating the file in appropriate subdirectory.
261
262 Using filename hashes has proven to provide a similar balance to using
263 file hashes. Furthermore, since filenames are known up front this
264 solution does not suffer from the listed problems. While hashes need
265 to be computed manually, hashing short string should not cause
266 any performance problems.
267
268 Jason Zaman has suggested to use package categories (and package names)
269 [#PKGNAME]_. However, this solution has multiple problems:
270
271 a. it does not solve the problem for large packages such as TeΧ Live,
272
273 b. it introduces many unnecessarily small directories,
274
275 c. it requires an explicit knowledge of which package distfiles
276 belong to,
277
278 d. it does not provide an explicit solution to the problem of distfiles
279 shared by multiple packages,
280
281 e. it does not provide a solution to the problem of injected distfiles.
282
283 All the options considered, the filename hash solution was selected
284 as one that solves all the forementioned problems while introducing
285 relatively low complexity and being reasonably future-proof.
286
287 .. figure:: glep-0075-extras/by-filename.png
288
289 Distribution of distfiles by first character of filenames
290 (note: y axis is on log scale)
291
292 .. figure:: glep-0075-extras/by-csum.png
293
294 Distribution of distfiles by first hex-digit of checksum
295 (x --- content checksum, + --- filename checksum)
296
297 .. figure:: glep-0075-extras/by-csum2.png
298
299 Distribution of distfiles by two first hex-digits of checksum
300 (x --- content checksum, + --- filename checksum)
301
302
303 Layout file
304 -----------
305 The presence of control file has been suggested in the original
306 discussion. Its main purpose is to let package managers cleanly handle
307 the migration and detect how to correctly query the mirrors throughout
308 it. Furthermore, it makes future changes easier.
309
310 The format lines specifically mean to hardcode as little about
311 the actual algorithm as possible. Therefore, we can easily change
312 the hash used or the exact split structure without having to update
313 the package managers or even provide a compatibility layout.
314
315 The file is also open for future extensions to provide additional mirror
316 metadata. However, no clear use for that has been determined so far.
317
318
319 Hash algorithm
320 --------------
321 The hash algorithm support is fully deferred to the existing code
322 in the package managers that is required to handle Manifests.
323 In particular, it is recommended to reuse one of the hashes that are
324 used in Manifest entries at the time. This avoids code duplication
325 and reuses an existing mechanism to handle hash upgrades.
326
327 During the discussion, it has been pointed that this particular use case
328 does not require a cryptographically strong hash and a faster algorithm
329 could be used instead. However, given the short length of hashed
330 strings performance is not a problem, and speed does not justify
331 the resulting code duplication.
332
333 It has also been pointed out that e.g. the BLAKE2 hash family provides
334 the ability of creating arbitrary length hashes instead of truncating
335 the standard-length hash. However, not all implementations of BLAKE2
336 support that and relying on it could reduce portability for no apparent
337 gain.
338
339
340 Backwards Compatibility
341 =======================
342 Mirror compatibility
343 --------------------
344 The mirrored files are propagated to other mirrors as opaque directory
345 structure. Therefore, there are no backwards compatibility concerns
346 on the mirroring side.
347
348 Backwards compatibility with existing clients is detailed
349 in `migrating mirrors to the hashed structure`_ section. Backwards
350 compatibility with the old clients will be provided by preserving
351 the flat structure during the transitional period.
352
353 The new clients will fetch the ``layout.conf`` file to avoid backwards
354 compatibility concerns in the future. In case of hitting an old mirror,
355 the package manager will default to the ``flat`` structure.
356
357
358 Package manager storage compatibility
359 -------------------------------------
360 The exact means of preserving backwards compatibility in package manager
361 storage are left to the package manager authors. However, it is
362 recommended that package managers continue to support the flat layout
363 even if it is no longer the default. The package manager may either
364 continue to read files from this location or automatically move them
365 to an appropriate subdirectory.
366
367
368 Reference Implementation
369 ========================
370 TODO.
371
372
373 References
374 ==========
375 .. [#DESKTOP_FORMAT] Desktop Entry Specification: Basic format of the file
376 (https://standards.freedesktop.org/desktop-entry-spec/latest/ar01s03.html)
377
378 .. [#GLEP74] GLEP 74: Full-tree verification using Manifest files:
379 Checksum algorithms (informational)
380 (https://www.gentoo.org/glep/glep-0074.html#checksum-algorithms-informational)
381
382 .. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications
383 (https://www.gentoo.org/glep/glep-0059.html)
384
385 .. [#BUG534528] Bug 534528 - distfiles should be sorted into subdirectories
386 of DISTDIR
387 (https://bugs.gentoo.org/534528)
388
389 .. [#ML1] [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
390 (https://archives.gentoo.org/gentoo-dev/message/cfc4f8595df2edf9a25ba9ecae2463ba)
391
392 .. [#ADAPTIVE_FILENAME] Andrew Barchuk's reply on 'using character ranges
393 for each directory computed in a way to have the files distributed evenly'
394 (https://archives.gentoo.org/gentoo-dev/message/611bdaa76be049c1d650e8995748e7b8)
395
396 .. [#PKGNAME] Jason Zamal's reply including 'using the same dir layout
397 as the packages themselves)
398 (https://archives.gentoo.org/gentoo-dev/message/f26ed870c3a6d4ecf69a821723642975)
399
400
401 Copyright
402 =========
403 This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
404 Unported License. To view a copy of this license, visit
405 http://creativecommons.org/licenses/by-sa/3.0/.
406
407 --
408 Best regards,
409 Michał Górny

Replies