Gentoo Archives: gentoo-dev

From: R0b0t1 <r030t1@×××××.com>
To: "gentoo-dev@l.g.o" <gentoo-dev@l.g.o>
Subject: Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure (draft v2)
Date: Mon, 29 Jan 2018 20:26:17
Message-Id: CAAD4mYj=ZDr3WFopvrqHoh7Cg+MeDS+yUKtTMn+-p=8DJuymgw@mail.gmail.com
In Reply to: Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure (draft v2) by "Michał Górny"
1 On Monday, January 29, 2018, Michał Górny <mgorny@g.o> wrote:
2 > Here's an updated version. I've tried to incorporate most
3 > of the feedback so far.
4 >
5 >
6 > ---
7 > GLEP: 75
8 > Title: Split distfile mirror directory structure
9 > Author: Michał Górny <mgorny@g.o>,
10 > Robin H. Johnson <robbat2@g.o>
11 > Type: Standards Track
12 > Status: Draft
13 > Version: 1
14 > Created: 2018-01-26
15 > Last-Modified: 2018-01-27
16 > Post-History: 2018-01-27
17 > Content-Type: text/x-rst
18 > ---
19 >
20 > Abstract
21 > ========
22 > This GLEP describes the procedure for splitting the distfiles on mirrors
23 > into multiple directories with the goal of reducing the number of files
24 > in a single directory.
25 >
26 >
27 > Motivation
28 > ==========
29 > At the moment, both the package manager and Gentoo mirrors use flat
30 > directory structure to store files. While this solution usually works,
31 > it does not scale well. Directories with large number of files usually
32 > have significant performance penalty, unless using filesystems
33 > specifically designed for that purpose.
34 >
35 > According to the Gentoo repository state at 2018-01-26 16:23, there
36 > was a total of 62652 unique distfiles in the repository. While
37 > the users realistically hit around 10% of that, distfile mirrors often
38 > hold even more files --- more so if old distfiles are not wiped
39 > immediately.
40 >
41 > While all filesystems used on Linux boxes should be able to cope with
42 > a number that large, they may suffer a performance penalty with even
43 > a few thousand files. Additionally, if mirrors enable directory indexes
44 > then generating the index imposes both a significant server overhead
45 > and a significant data transfer. At this moment, the index
46 > of distfiles.gentoo.org has around 17 MiB.
47 >
48 > Splitting the distfiles into multiple directories makes it possible
49 > to avoid those problems by reducing the number of files in a single
50 > directory. For example, splitting the forementioned set of distfiles
51 > into 16 directories that are roughly balanced allows to reduce
52 > the number of files in a single directory to around 4000. Splitting
53 > them further into 256 directories (16x16) results in 200-300 files
54 > per directory which should avoid any performance problems long-term,
55 > even assuming 300% growth of number of distfiles.
56 >
57 >
58 > Specification
59 > =============
60 > Mirror layout file
61 > ------------------
62 > A mirror adhering to this specification should include a ``layout.conf``
63 > file in the top distfile directory. This file uses the format
64 > derived from the freedesktop Desktop Entry Specification file format
65 > [#DESKTOP_FORMAT]_.
66 >
67 > Before using each Gentoo mirror, the package manager should attempt
68 > to fetch (update) its ``layout.conf`` file and process it to determine
69 > how to use the mirror. If the file is not present, the package manager
70 > should behave as if it were empty.
71 >
72 > The package manager should recognize the sections and keys listed below.
73 > It should ignore any unrecognized sections or keys --- the format
74 > is intended to account for future extensions.
75 >
76 > This specification currently defines one section: ``[structure]``.
77 > This section defines one or more repository structure definitions
78 > using non-negative sequential integer keys. The definition with
79 > the ``0`` key is the most preferred structure. The package manager
80 > should ignore any formats it does not recognize. If this section
81 > is not present, the package manager should behave as if only ``flat``
82 > structure were specified.
83 >
84 > The following structure definitions are supported:
85 >
86 > * ``flat`` to indicate the traditional flat structure where all
87 > distfiles are located in the top directory,
88 >
89 > * ``filename-hash <algorithm> <cutoffs>`` to indicate the `filename
90 > hash structure`_ explained below.
91 >
92 >
93 > Filename hash structure
94 > -----------------------
95 > When using the filename hash structure, the distfiles are split
96 > into directories whose names are derived from the hash of distfile
97 > filename. This structure has two parameters: *algorithm name*
98 > and *cutoffs* list.
99 >
100 > The algorithm name must correspond to a valid Manifest hash name.
101 > An informational list of hashes is included in GLEP 74 [#GLEP74]_,
102 > and the policies for introducing new hashes are covered by GLEP 59
103 > [#GLEP59]_.
104 >
105 > The cutoffs list specifies one or more integers separated by colons
106 > (``:``), indicating the number of bits (starting with the most
107 > significant bit) of the hash used to form subsequent subdirectory names.
108 > For example, the list of ``2:4`` would indicate that top-level directory
109 > names are formed using 2 most significant bits of the hash (resulting
110 > in 2² = 4 directories), and each of this directories would have
111 > subdirectories formed using the next 4 bits of the hash (resulting
112 > in 2⁴ = 8 subdirectories each).
113 >
114 > The exact algorithm for determining the distfile location follows:
115 >
116 > 1. Let the distfile filename be **F**.
117 >
118 > 2. Compute the hash of **F** and store its binary value as **H**.
119 >
120 > 3. For each integer **C** in cutoff list:
121 >
122 > a. Take **C** most significant bits of **H** and store them as **V**.
123 >
124 > b. Convert **V** into hexadecimal integer, left padded with zeros
125 > to **C/4** digits (rounded up) and append it to the path, followed
126 > by the path separator.
127 >
128 > c. Shift **H** left **C** bits.
129 >
130 > 4. Finally, append **F** to the obtained path.
131 >
132 > In particular, note that when using nested directories
133 > the subdirectories do not repeat the hash bits used in parent directory.
134 >
135 >
136 > Migrating mirrors to the hashed structure
137 > -----------------------------------------
138 > Since all distfile mirrors sync to the master Gentoo mirror, it should
139 > be enough to perform all the needed changes on the master mirror
140 > and wait for other mirrors to sync. The following procedure
141 > is recommended:
142 >
143 > 1. Include the initial ``layout.conf`` listing only ``flat`` layout.
144 >
145 > 2. Create the new structure alongside the flat structure. Wait for
146 > mirrors to sync.
147 >
148 > 3. Once all mirrors receive the new structure, update ``layout.conf``
149 > to list the ``filename-hash`` structure.
150 >
151 > 4. Once a version of Portage supporting the new structure is stable long
152 > enough, remove the fallback ``flat`` structure from ``layout.conf``
153 > and duplicate distfiles.
154 >
155 > This implies that during the migration period the distfiles will
156 > be stored duplicated on the mirrors and therefore will occupy twice
157 > as much space. Technically, this could be avoided either by using
158 > hard links or symbolic links.
159 >
160 > The hard link solution allows us to save space on the master mirror.
161 > Additionally, if ``-H`` option is used by the mirrors it avoids
162 > transferring existing files again. However, this option is known
163 > to be expensive and could cause significant server load. Without it,
164 > all mirrors need to transfer a second copy of all the existing files.
165 >
166 > The symbolic link solution could be more reliable if we could rely
167 > on mirrors using the ``--links`` rsync option. Without that, symbolic
168 > links are not transferred at all.
169 >
170 >
171 > Using hashed structure for local distfiles
172 > ------------------------------------------
173 > The hashed structure defined above could also be used for local distfile
174 > storage as used by the package manager. For this to work, the package
175 > manager authors need to ensure that:
176 >
177 > a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary
178 > directory where distfiles specific to the package are linked
179 > in a flat structure.
180 >
181 > b. All tools are updated to support the nested structure.
182 >
183 > c. The package manager provides a tool for users to easily manipulate
184 > distfiles, in particular to add distfiles for fetch-restricted
185 > packages into an appropriate subdirectory.
186 >
187 > For extended compatibility, the package manager may support finding
188 > distfiles in flat and nested structure simultaneously.
189 >
190 >
191 > Rationale
192 > =========
193 > Algorithm for splitting distfiles
194 > ---------------------------------
195 > The possible algorithms were considered with the following goals
196 > in mind:
197 >
198 > - the number of files in a single directory should not exceed 1000,
199 >
200 > - the total size of files in a single directory is not considered
201 > relevant,
202 >
203 > - the solution should preferably be future-proof,
204 >
205 > - moving distfiles should be avoided once it is deployed.
206 >
207 > It should also be noted that at this moment the package having most
208 > distfiles in Gentoo at the time is dev-texlive/texlive-latexextra,
209 > with the number of 8556 distfiles. All of them start with a common
210 > prefix of ``texlive-module-``. This specific prefix is used by a total
211 > of 23435 distfiles.
212 >
213 > In the original debate that occurred in bug #534528 [#BUG534528]_
214 > and the mailing list review of the initial version of this GLEP [#ML1]_,
215 > four fundamental ideas for splitting distfiles were listed:
216 >
217 > a. using initial portion of filename,
218 >
219 > b. using initial portion of file hash,
220 >
221 > c. using initial portion of filename hash,
222 >
223 > d. using package category (and package name).
224 >
225 > The initial filename idea was to use the first character of filename,
226 > possibly followed by a longer part which was the idea historically
227 > used e.g. by PyPI Python package hosting. Its main advantage is
228 > simplicity. The users can easily determine the correct subdirectory
229 > by just looking at the distfile name. Sadly, this solution is not only
230 > very uneven but does not solve the problem. As mentioned above,
231 > the TeΧ Live packages share a long common prefix that make it impossible
232 > to split it properly with other packages on fixed-length prefixes.
233 >
234 > This idea has been followed by an adaptive proposal by Andrew Barchuk
235 > [#ADAPTIVE_FILENAME]_. In this proposal, the filenames are not strictly
236 > mapped to groups by a common prefix but instead each group contains
237 > all files between two prefixes being used (like in a dictionary).
238 > However, it has been pointed out that while this option can provide
239 > very even results initially, it is impossible to predict how it would
240 > be affected by future distfile changes and there will be a risk of
241 > needing to change the groups in the future. Furthermore, it is
242 > relatively complex and requires explicitly listing or obtaining used
243 > groups.
244 >
245 > Another option was to use an initial portion of distfile hashes. Its
246 > main advantage is that cryptographic hash algorithms can provide
247 > a more balanced split with random data. Furthermore, since hashes are
248 > stored in Manifests using them has no cost for users. However, this
249 > solution has three disadvantages:
250 >
251 > 1. Not all files in the distfile tree are covered by package Manifests.
252 > Additional files are injected into the mirrors, and those will
253 > not have a clearly-defined location.
254 >
255 > 2. User-provided distfiles (e.g. for fetch-restricted packages) with
256 > hash mismatches would be placed in the wrong subdirectory,
257 > potentially causing confusing errors.
258 >
259 > 3. The hash values are unknown for newly-downloaded distfiles, so
260 > ``repoman`` (or an equivalent tool) would have to use a temporary
261 > directory before locating the file in appropriate subdirectory.
262 >
263 > Using filename hashes has proven to provide a similar balance to using
264 > file hashes. Furthermore, since filenames are known up front this
265 > solution does not suffer from the listed problems. While hashes need
266 > to be computed manually, hashing short string should not cause
267 > any performance problems.
268 >
269 > Jason Zaman has suggested to use package categories (and package names)
270 > [#PKGNAME]_. However, this solution has multiple problems:
271 >
272 > a. it does not solve the problem for large packages such as TeΧ Live,
273 >
274 > b. it introduces many unnecessarily small directories,
275 >
276 > c. it requires an explicit knowledge of which package distfiles
277 > belong to,
278 >
279 > d. it does not provide an explicit solution to the problem of distfiles
280 > shared by multiple packages,
281 >
282 > e. it does not provide a solution to the problem of injected distfiles.
283 >
284 > All the options considered, the filename hash solution was selected
285 > as one that solves all the forementioned problems while introducing
286 > relatively low complexity and being reasonably future-proof.
287 >
288 > .. figure:: glep-0075-extras/by-filename.png
289 >
290 > Distribution of distfiles by first character of filenames
291 > (note: y axis is on log scale)
292 >
293 > .. figure:: glep-0075-extras/by-csum.png
294 >
295 > Distribution of distfiles by first hex-digit of checksum
296 > (x --- content checksum, + --- filename checksum)
297 >
298 > .. figure:: glep-0075-extras/by-csum2.png
299 >
300 > Distribution of distfiles by two first hex-digits of checksum
301 > (x --- content checksum, + --- filename checksum)
302 >
303 >
304 > Layout file
305 > -----------
306 > The presence of control file has been suggested in the original
307 > discussion. Its main purpose is to let package managers cleanly handle
308 > the migration and detect how to correctly query the mirrors throughout
309 > it. Furthermore, it makes future changes easier.
310 >
311 > The format lines specifically mean to hardcode as little about
312 > the actual algorithm as possible. Therefore, we can easily change
313 > the hash used or the exact split structure without having to update
314 > the package managers or even provide a compatibility layout.
315 >
316 > The file is also open for future extensions to provide additional mirror
317 > metadata. However, no clear use for that has been determined so far.
318 >
319 >
320 > Hash algorithm
321 > --------------
322 > The hash algorithm support is fully deferred to the existing code
323 > in the package managers that is required to handle Manifests.
324 > In particular, it is recommended to reuse one of the hashes that are
325 > used in Manifest entries at the time. This avoids code duplication
326 > and reuses an existing mechanism to handle hash upgrades.
327 >
328 > During the discussion, it has been pointed that this particular use case
329 > does not require a cryptographically strong hash and a faster algorithm
330 > could be used instead. However, given the short length of hashed
331 > strings performance is not a problem, and speed does not justify
332 > the resulting code duplication.
333 >
334 > It has also been pointed out that e.g. the BLAKE2 hash family provides
335 > the ability of creating arbitrary length hashes instead of truncating
336 > the standard-length hash. However, not all implementations of BLAKE2
337 > support that and relying on it could reduce portability for no apparent
338 > gain.
339 >
340 >
341 > Backwards Compatibility
342 > =======================
343 > Mirror compatibility
344 > --------------------
345 > The mirrored files are propagated to other mirrors as opaque directory
346 > structure. Therefore, there are no backwards compatibility concerns
347 > on the mirroring side.
348 >
349 > Backwards compatibility with existing clients is detailed
350 > in `migrating mirrors to the hashed structure`_ section. Backwards
351 > compatibility with the old clients will be provided by preserving
352 > the flat structure during the transitional period.
353 >
354 > The new clients will fetch the ``layout.conf`` file to avoid backwards
355 > compatibility concerns in the future. In case of hitting an old mirror,
356 > the package manager will default to the ``flat`` structure.
357 >
358 >
359 > Package manager storage compatibility
360 > -------------------------------------
361 > The exact means of preserving backwards compatibility in package manager
362 > storage are left to the package manager authors. However, it is
363 > recommended that package managers continue to support the flat layout
364 > even if it is no longer the default. The package manager may either
365 > continue to read files from this location or automatically move them
366 > to an appropriate subdirectory.
367 >
368 >
369 > Reference Implementation
370 > ========================
371 > TODO.
372 >
373 >
374 > References
375 > ==========
376 > .. [#DESKTOP_FORMAT] Desktop Entry Specification: Basic format of the file
377 > (
378 https://standards.freedesktop.org/desktop-entry-spec/latest/ar01s03.html)
379 >
380 > .. [#GLEP74] GLEP 74: Full-tree verification using Manifest files:
381 > Checksum algorithms (informational)
382 > (
383 https://www.gentoo.org/glep/glep-0074.html#checksum-algorithms-informational
384 )
385 >
386 > .. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications
387 > (https://www.gentoo.org/glep/glep-0059.html)
388 >
389 > .. [#BUG534528] Bug 534528 - distfiles should be sorted into
390 subdirectories
391 > of DISTDIR
392 > (https://bugs.gentoo.org/534528)
393 >
394 > .. [#ML1] [gentoo-dev] [pre-GLEP] Split distfile mirror directory
395 structure
396 > (
397 https://archives.gentoo.org/gentoo-dev/message/cfc4f8595df2edf9a25ba9ecae2463ba
398 )
399 >
400 > .. [#ADAPTIVE_FILENAME] Andrew Barchuk's reply on 'using character ranges
401 > for each directory computed in a way to have the files distributed
402 evenly'
403 > (
404 https://archives.gentoo.org/gentoo-dev/message/611bdaa76be049c1d650e8995748e7b8
405 )
406 >
407 > .. [#PKGNAME] Jason Zamal's reply including 'using the same dir layout
408 > as the packages themselves)
409 > (
410 https://archives.gentoo.org/gentoo-dev/message/f26ed870c3a6d4ecf69a821723642975
411 )
412 >
413 >
414 > Copyright
415 > =========
416 > This work is licensed under the Creative Commons Attribution-ShareAlike
417 3.0
418 > Unported License. To view a copy of this license, visit
419 > http://creativecommons.org/licenses/by-sa/3.0/.
420 >
421 > --
422 > Best regards,
423 > Michał Górny
424 >
425
426 It's going to be hash based? Why? I tried to follow the conversation but
427 there's now close to 5 of these posts in the mailing list with different
428 conversations in each.
429
430 Using filename prefixes is boring and not uniform, but I feel I should
431 point out that most distfile hosts are still doing fine. Microoptimizing
432 this seems like wasted effort.
433
434 Cheers,
435 R0b0t1

Replies