Gentoo Archives: gentoo-dev

From: Alec Warner <antarus@g.o>
To: Gentoo Dev <gentoo-dev@l.g.o>
Subject: Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure (draft v2)
Date: Mon, 29 Jan 2018 21:00:35
Message-Id: CAAr7Pr-1PRaeK2--rnNnTFmKhH6WitztoLambUq1tRCK1wjwUg@mail.gmail.com
In Reply to: Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure (draft v2) by R0b0t1
1 On Mon, Jan 29, 2018 at 3:26 PM, R0b0t1 <r030t1@×××××.com> wrote:
2
3 > On Monday, January 29, 2018, Michał Górny <mgorny@g.o> wrote:
4 > > Here's an updated version. I've tried to incorporate most
5 > > of the feedback so far.
6 > >
7 > >
8 > > ---
9 > > GLEP: 75
10 > > Title: Split distfile mirror directory structure
11 > > Author: Michał Górny <mgorny@g.o>,
12 > > Robin H. Johnson <robbat2@g.o>
13 > > Type: Standards Track
14 > > Status: Draft
15 > > Version: 1
16 > > Created: 2018-01-26
17 > > Last-Modified: 2018-01-27
18 > > Post-History: 2018-01-27
19 > > Content-Type: text/x-rst
20 > > ---
21 > >
22 > > Abstract
23 > > ========
24 > > This GLEP describes the procedure for splitting the distfiles on mirrors
25 > > into multiple directories with the goal of reducing the number of files
26 > > in a single directory.
27 > >
28 > >
29 > > Motivation
30 > > ==========
31 > > At the moment, both the package manager and Gentoo mirrors use flat
32 > > directory structure to store files. While this solution usually works,
33 > > it does not scale well. Directories with large number of files usually
34 > > have significant performance penalty, unless using filesystems
35 > > specifically designed for that purpose.
36 > >
37 > > According to the Gentoo repository state at 2018-01-26 16:23, there
38 > > was a total of 62652 unique distfiles in the repository. While
39 > > the users realistically hit around 10% of that, distfile mirrors often
40 > > hold even more files --- more so if old distfiles are not wiped
41 > > immediately.
42 > >
43 > > While all filesystems used on Linux boxes should be able to cope with
44 > > a number that large, they may suffer a performance penalty with even
45 > > a few thousand files. Additionally, if mirrors enable directory indexes
46 > > then generating the index imposes both a significant server overhead
47 > > and a significant data transfer. At this moment, the index
48 > > of distfiles.gentoo.org has around 17 MiB.
49 > >
50 > > Splitting the distfiles into multiple directories makes it possible
51 > > to avoid those problems by reducing the number of files in a single
52 > > directory. For example, splitting the forementioned set of distfiles
53 > > into 16 directories that are roughly balanced allows to reduce
54 > > the number of files in a single directory to around 4000. Splitting
55 > > them further into 256 directories (16x16) results in 200-300 files
56 > > per directory which should avoid any performance problems long-term,
57 > > even assuming 300% growth of number of distfiles.
58 > >
59 > >
60 > > Specification
61 > > =============
62 > > Mirror layout file
63 > > ------------------
64 > > A mirror adhering to this specification should include a ``layout.conf``
65 > > file in the top distfile directory. This file uses the format
66 > > derived from the freedesktop Desktop Entry Specification file format
67 > > [#DESKTOP_FORMAT]_.
68 > >
69 > > Before using each Gentoo mirror, the package manager should attempt
70 > > to fetch (update) its ``layout.conf`` file and process it to determine
71 > > how to use the mirror. If the file is not present, the package manager
72 > > should behave as if it were empty.
73 > >
74 > > The package manager should recognize the sections and keys listed below.
75 > > It should ignore any unrecognized sections or keys --- the format
76 > > is intended to account for future extensions.
77 > >
78 > > This specification currently defines one section: ``[structure]``.
79 > > This section defines one or more repository structure definitions
80 > > using non-negative sequential integer keys. The definition with
81 > > the ``0`` key is the most preferred structure. The package manager
82 > > should ignore any formats it does not recognize. If this section
83 > > is not present, the package manager should behave as if only ``flat``
84 > > structure were specified.
85 > >
86 > > The following structure definitions are supported:
87 > >
88 > > * ``flat`` to indicate the traditional flat structure where all
89 > > distfiles are located in the top directory,
90 > >
91 > > * ``filename-hash <algorithm> <cutoffs>`` to indicate the `filename
92 > > hash structure`_ explained below.
93 > >
94 > >
95 > > Filename hash structure
96 > > -----------------------
97 > > When using the filename hash structure, the distfiles are split
98 > > into directories whose names are derived from the hash of distfile
99 > > filename. This structure has two parameters: *algorithm name*
100 > > and *cutoffs* list.
101 > >
102 > > The algorithm name must correspond to a valid Manifest hash name.
103 > > An informational list of hashes is included in GLEP 74 [#GLEP74]_,
104 > > and the policies for introducing new hashes are covered by GLEP 59
105 > > [#GLEP59]_.
106 > >
107 > > The cutoffs list specifies one or more integers separated by colons
108 > > (``:``), indicating the number of bits (starting with the most
109 > > significant bit) of the hash used to form subsequent subdirectory names.
110 > > For example, the list of ``2:4`` would indicate that top-level directory
111 > > names are formed using 2 most significant bits of the hash (resulting
112 > > in 2² = 4 directories), and each of this directories would have
113 > > subdirectories formed using the next 4 bits of the hash (resulting
114 > > in 2⁴ = 8 subdirectories each).
115 > >
116 > > The exact algorithm for determining the distfile location follows:
117 > >
118 > > 1. Let the distfile filename be **F**.
119 > >
120 > > 2. Compute the hash of **F** and store its binary value as **H**.
121 > >
122 > > 3. For each integer **C** in cutoff list:
123 > >
124 > > a. Take **C** most significant bits of **H** and store them as **V**.
125 > >
126 > > b. Convert **V** into hexadecimal integer, left padded with zeros
127 > > to **C/4** digits (rounded up) and append it to the path, followed
128 > > by the path separator.
129 > >
130 > > c. Shift **H** left **C** bits.
131 > >
132 > > 4. Finally, append **F** to the obtained path.
133 > >
134 > > In particular, note that when using nested directories
135 > > the subdirectories do not repeat the hash bits used in parent directory.
136 > >
137 > >
138 > > Migrating mirrors to the hashed structure
139 > > -----------------------------------------
140 > > Since all distfile mirrors sync to the master Gentoo mirror, it should
141 > > be enough to perform all the needed changes on the master mirror
142 > > and wait for other mirrors to sync. The following procedure
143 > > is recommended:
144 > >
145 > > 1. Include the initial ``layout.conf`` listing only ``flat`` layout.
146 > >
147 > > 2. Create the new structure alongside the flat structure. Wait for
148 > > mirrors to sync.
149 > >
150 > > 3. Once all mirrors receive the new structure, update ``layout.conf``
151 > > to list the ``filename-hash`` structure.
152 > >
153 > > 4. Once a version of Portage supporting the new structure is stable long
154 > > enough, remove the fallback ``flat`` structure from ``layout.conf``
155 > > and duplicate distfiles.
156 > >
157 > > This implies that during the migration period the distfiles will
158 > > be stored duplicated on the mirrors and therefore will occupy twice
159 > > as much space. Technically, this could be avoided either by using
160 > > hard links or symbolic links.
161 > >
162 > > The hard link solution allows us to save space on the master mirror.
163 > > Additionally, if ``-H`` option is used by the mirrors it avoids
164 > > transferring existing files again. However, this option is known
165 > > to be expensive and could cause significant server load. Without it,
166 > > all mirrors need to transfer a second copy of all the existing files.
167 > >
168 > > The symbolic link solution could be more reliable if we could rely
169 > > on mirrors using the ``--links`` rsync option. Without that, symbolic
170 > > links are not transferred at all.
171 > >
172 > >
173 > > Using hashed structure for local distfiles
174 > > ------------------------------------------
175 > > The hashed structure defined above could also be used for local distfile
176 > > storage as used by the package manager. For this to work, the package
177 > > manager authors need to ensure that:
178 > >
179 > > a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary
180 > > directory where distfiles specific to the package are linked
181 > > in a flat structure.
182 > >
183 > > b. All tools are updated to support the nested structure.
184 > >
185 > > c. The package manager provides a tool for users to easily manipulate
186 > > distfiles, in particular to add distfiles for fetch-restricted
187 > > packages into an appropriate subdirectory.
188 > >
189 > > For extended compatibility, the package manager may support finding
190 > > distfiles in flat and nested structure simultaneously.
191 > >
192 > >
193 > > Rationale
194 > > =========
195 > > Algorithm for splitting distfiles
196 > > ---------------------------------
197 > > The possible algorithms were considered with the following goals
198 > > in mind:
199 > >
200 > > - the number of files in a single directory should not exceed 1000,
201 > >
202 > > - the total size of files in a single directory is not considered
203 > > relevant,
204 > >
205 > > - the solution should preferably be future-proof,
206 > >
207 > > - moving distfiles should be avoided once it is deployed.
208 > >
209 > > It should also be noted that at this moment the package having most
210 > > distfiles in Gentoo at the time is dev-texlive/texlive-latexextra,
211 > > with the number of 8556 distfiles. All of them start with a common
212 > > prefix of ``texlive-module-``. This specific prefix is used by a total
213 > > of 23435 distfiles.
214 > >
215 > > In the original debate that occurred in bug #534528 [#BUG534528]_
216 > > and the mailing list review of the initial version of this GLEP [#ML1]_,
217 > > four fundamental ideas for splitting distfiles were listed:
218 > >
219 > > a. using initial portion of filename,
220 > >
221 > > b. using initial portion of file hash,
222 > >
223 > > c. using initial portion of filename hash,
224 > >
225 > > d. using package category (and package name).
226 > >
227 > > The initial filename idea was to use the first character of filename,
228 > > possibly followed by a longer part which was the idea historically
229 > > used e.g. by PyPI Python package hosting. Its main advantage is
230 > > simplicity. The users can easily determine the correct subdirectory
231 > > by just looking at the distfile name. Sadly, this solution is not only
232 > > very uneven but does not solve the problem. As mentioned above,
233 > > the TeΧ Live packages share a long common prefix that make it impossible
234 > > to split it properly with other packages on fixed-length prefixes.
235 > >
236 > > This idea has been followed by an adaptive proposal by Andrew Barchuk
237 > > [#ADAPTIVE_FILENAME]_. In this proposal, the filenames are not strictly
238 > > mapped to groups by a common prefix but instead each group contains
239 > > all files between two prefixes being used (like in a dictionary).
240 > > However, it has been pointed out that while this option can provide
241 > > very even results initially, it is impossible to predict how it would
242 > > be affected by future distfile changes and there will be a risk of
243 > > needing to change the groups in the future. Furthermore, it is
244 > > relatively complex and requires explicitly listing or obtaining used
245 > > groups.
246 > >
247 > > Another option was to use an initial portion of distfile hashes. Its
248 > > main advantage is that cryptographic hash algorithms can provide
249 > > a more balanced split with random data. Furthermore, since hashes are
250 > > stored in Manifests using them has no cost for users. However, this
251 > > solution has three disadvantages:
252 > >
253 > > 1. Not all files in the distfile tree are covered by package Manifests.
254 > > Additional files are injected into the mirrors, and those will
255 > > not have a clearly-defined location.
256 > >
257 > > 2. User-provided distfiles (e.g. for fetch-restricted packages) with
258 > > hash mismatches would be placed in the wrong subdirectory,
259 > > potentially causing confusing errors.
260 > >
261 > > 3. The hash values are unknown for newly-downloaded distfiles, so
262 > > ``repoman`` (or an equivalent tool) would have to use a temporary
263 > > directory before locating the file in appropriate subdirectory.
264 > >
265 > > Using filename hashes has proven to provide a similar balance to using
266 > > file hashes. Furthermore, since filenames are known up front this
267 > > solution does not suffer from the listed problems. While hashes need
268 > > to be computed manually, hashing short string should not cause
269 > > any performance problems.
270 > >
271 > > Jason Zaman has suggested to use package categories (and package names)
272 > > [#PKGNAME]_. However, this solution has multiple problems:
273 > >
274 > > a. it does not solve the problem for large packages such as TeΧ Live,
275 > >
276 > > b. it introduces many unnecessarily small directories,
277 > >
278 > > c. it requires an explicit knowledge of which package distfiles
279 > > belong to,
280 > >
281 > > d. it does not provide an explicit solution to the problem of distfiles
282 > > shared by multiple packages,
283 > >
284 > > e. it does not provide a solution to the problem of injected distfiles.
285 > >
286 > > All the options considered, the filename hash solution was selected
287 > > as one that solves all the forementioned problems while introducing
288 > > relatively low complexity and being reasonably future-proof.
289 > >
290 > > .. figure:: glep-0075-extras/by-filename.png
291 > >
292 > > Distribution of distfiles by first character of filenames
293 > > (note: y axis is on log scale)
294 > >
295 > > .. figure:: glep-0075-extras/by-csum.png
296 > >
297 > > Distribution of distfiles by first hex-digit of checksum
298 > > (x --- content checksum, + --- filename checksum)
299 > >
300 > > .. figure:: glep-0075-extras/by-csum2.png
301 > >
302 > > Distribution of distfiles by two first hex-digits of checksum
303 > > (x --- content checksum, + --- filename checksum)
304 > >
305 > >
306 > > Layout file
307 > > -----------
308 > > The presence of control file has been suggested in the original
309 > > discussion. Its main purpose is to let package managers cleanly handle
310 > > the migration and detect how to correctly query the mirrors throughout
311 > > it. Furthermore, it makes future changes easier.
312 > >
313 > > The format lines specifically mean to hardcode as little about
314 > > the actual algorithm as possible. Therefore, we can easily change
315 > > the hash used or the exact split structure without having to update
316 > > the package managers or even provide a compatibility layout.
317 > >
318 > > The file is also open for future extensions to provide additional mirror
319 > > metadata. However, no clear use for that has been determined so far.
320 > >
321 > >
322 > > Hash algorithm
323 > > --------------
324 > > The hash algorithm support is fully deferred to the existing code
325 > > in the package managers that is required to handle Manifests.
326 > > In particular, it is recommended to reuse one of the hashes that are
327 > > used in Manifest entries at the time. This avoids code duplication
328 > > and reuses an existing mechanism to handle hash upgrades.
329 > >
330 > > During the discussion, it has been pointed that this particular use case
331 > > does not require a cryptographically strong hash and a faster algorithm
332 > > could be used instead. However, given the short length of hashed
333 > > strings performance is not a problem, and speed does not justify
334 > > the resulting code duplication.
335 > >
336 > > It has also been pointed out that e.g. the BLAKE2 hash family provides
337 > > the ability of creating arbitrary length hashes instead of truncating
338 > > the standard-length hash. However, not all implementations of BLAKE2
339 > > support that and relying on it could reduce portability for no apparent
340 > > gain.
341 > >
342 > >
343 > > Backwards Compatibility
344 > > =======================
345 > > Mirror compatibility
346 > > --------------------
347 > > The mirrored files are propagated to other mirrors as opaque directory
348 > > structure. Therefore, there are no backwards compatibility concerns
349 > > on the mirroring side.
350 > >
351 > > Backwards compatibility with existing clients is detailed
352 > > in `migrating mirrors to the hashed structure`_ section. Backwards
353 > > compatibility with the old clients will be provided by preserving
354 > > the flat structure during the transitional period.
355 > >
356 > > The new clients will fetch the ``layout.conf`` file to avoid backwards
357 > > compatibility concerns in the future. In case of hitting an old mirror,
358 > > the package manager will default to the ``flat`` structure.
359 > >
360 > >
361 > > Package manager storage compatibility
362 > > -------------------------------------
363 > > The exact means of preserving backwards compatibility in package manager
364 > > storage are left to the package manager authors. However, it is
365 > > recommended that package managers continue to support the flat layout
366 > > even if it is no longer the default. The package manager may either
367 > > continue to read files from this location or automatically move them
368 > > to an appropriate subdirectory.
369 > >
370 > >
371 > > Reference Implementation
372 > > ========================
373 > > TODO.
374 > >
375 > >
376 > > References
377 > > ==========
378 > > .. [#DESKTOP_FORMAT] Desktop Entry Specification: Basic format of the
379 > file
380 > > (https://standards.freedesktop.org/desktop-entry-
381 > spec/latest/ar01s03.html)
382 > >
383 > > .. [#GLEP74] GLEP 74: Full-tree verification using Manifest files:
384 > > Checksum algorithms (informational)
385 > > (https://www.gentoo.org/glep/glep-0074.html#checksum-
386 > algorithms-informational)
387 > >
388 > > .. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications
389 > > (https://www.gentoo.org/glep/glep-0059.html)
390 > >
391 > > .. [#BUG534528] Bug 534528 - distfiles should be sorted into
392 > subdirectories
393 > > of DISTDIR
394 > > (https://bugs.gentoo.org/534528)
395 > >
396 > > .. [#ML1] [gentoo-dev] [pre-GLEP] Split distfile mirror directory
397 > structure
398 > > (https://archives.gentoo.org/gentoo-dev/message/
399 > cfc4f8595df2edf9a25ba9ecae2463ba)
400 > >
401 > > .. [#ADAPTIVE_FILENAME] Andrew Barchuk's reply on 'using character ranges
402 > > for each directory computed in a way to have the files distributed
403 > evenly'
404 > > (https://archives.gentoo.org/gentoo-dev/message/
405 > 611bdaa76be049c1d650e8995748e7b8)
406 > >
407 > > .. [#PKGNAME] Jason Zamal's reply including 'using the same dir layout
408 > > as the packages themselves)
409 > > (https://archives.gentoo.org/gentoo-dev/message/
410 > f26ed870c3a6d4ecf69a821723642975)
411 > >
412 > >
413 > > Copyright
414 > > =========
415 > > This work is licensed under the Creative Commons Attribution-ShareAlike
416 > 3.0
417 > > Unported License. To view a copy of this license, visit
418 > > http://creativecommons.org/licenses/by-sa/3.0/.
419 > >
420 > > --
421 > > Best regards,
422 > > Michał Górny
423 > >
424 >
425 > It's going to be hash based? Why? I tried to follow the conversation but
426 > there's now close to 5 of these posts in the mailing list with different
427 > conversations in each.
428 >
429
430 The reasoning is embedded in the proposal; search for "Algorithm for
431 splitting distfiles". Read that section. I think it summarizes the space
432 well.
433
434 -A
435
436
437 >
438 > Using filename prefixes is boring and not uniform, but I feel I should
439 > point out that most distfile hosts are still doing fine. Microoptimizing
440 > this seems like wasted effort.
441 >
442 > Cheers,
443 > R0b0t1