Gentoo Archives: gentoo-dev

From: "Michał Górny" <mgorny@g.o>
To: gentoo-dev <gentoo-dev@l.g.o>
Subject: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
Date: Fri, 26 Jan 2018 23:24:53
Message-Id: 1517009079.31015.3.camel@gentoo.org
1 Hi, everyone.
2
3 Here's a little new something we've been silently debating back
4 in the day, then forgotten about it, then I've written a GLEP about it.
5 The number's not official yet.
6
7 HTML (with plots): https://dev.gentoo.org/~mgorny/tmp/glep-0075.html
8
9 ---
10 GLEP: 75
11 Title: Split distfile mirror directory structure
12 Author:
13 Michał Górny <mgorny@g.o>,
14 Robin H. Johnson <robbat2@gento
15 o.org>
16 Type: Standards Track
17 Status: Draft
18 Version: 1
19 Created: 2018-01-26
20 Las
21 t-Modified: 2018-01-27
22 Post-History: 2018-01-27
23 Content-Type: text/x-rst
24 --
25 -
26
27 Abstract
28 ========
29 This GLEP describes the procedure for splitting the distfiles on mirrors
30 into multiple directories with the goal of reducing the number of files
31 in a single directory.
32
33
34 Motivation
35 ==========
36 At the moment, both the package manager and Gentoo mirrors use flat
37 directory structure to store files. While this solution usually works,
38 it does not scale well. Directories with large number of files usually
39 have significant performance penalty, unless using filesystems
40 specifically designed for that purpose.
41
42 According to the Gentoo repository state at 2018-01-26 16:23, there
43 was a total of 62652 unique distfiles in the repository. While
44 the users realistically hit around 10% of that, distfile mirrors often
45 hold even more files --- more so if old distfiles are not wiped
46 immediately.
47
48 While all filesystems used on Linux boxes should be able to cope with
49 a number that large, they may suffer a performance penalty with even
50 a few thousand files. Additionally, if mirrors enable directory indexes
51 then generating the index imposes both a significant server overhead
52 and a significant data transfer. At this moment, the index
53 of distfiles.gentoo.org has around 17 MiB.
54
55 Splitting the distfiles into multiple directories makes it possible
56 to avoid those problems by reducing the number of files in a single
57 directory. For example, splitting the forementioned set of distfiles
58 into 16 directories that are roughly balanced allows to reduce
59 the number of files in a single directory to around 4000. Splitting
60 them further into 256 directories (16x16) results in 200-300 files
61 per directory which should avoid any performance problems long-term,
62 even assuming 300% growth of number of distfiles.
63
64
65 Specification
66 =============
67 Mirror layout file
68 ------------------
69 A mirror adhering to this specification should include a ``layout.conf``
70 file in the top distfile directory. This file uses the format
71 derived from the freedesktop Desktop Entry Specification file format
72 [#DESKTOP_FORMAT]_.
73
74 Before using each Gentoo mirror, the package manager should attempt
75 to fetch (update) its ``layout.conf`` file and process it to determine
76 how to use the mirror. If the file is not present, the package manager
77 should behave as if it were empty.
78
79 The package manager should recognize the sections and keys listed below.
80 It should ignore any unrecognized sections or keys --- the format
81 is intended to account for future extensions.
82
83 This specification currently defines one section: ``[structure]``.
84 This section defines one or more repository structure definitions
85 using sequential integer keys. The definition keyed as ``0``
86 is the most preferred structure. The package manager should use
87 the first structure format it recognizes as supported, and ignore any
88 it does not recognize. If this section is not present, the package
89 manager should behave as if only ``flat`` structure were supported.
90
91 The following structure definitions are supported:
92
93 * ``flat`` to indicate the traditional flat structure where all
94 distfiles are located in the top directory,
95
96 * ``filename-hash <algorithm> <cutoffs>`` to indicate the `filename
97 hash structure`_ explained below.
98
99
100 Filename hash structure
101 -----------------------
102 When using the filename hash structure, the distfiles are split
103 into directories whose names are derived from the hash of distfile
104 filename. This structure has two parameters: *algorithm name*
105 and *cutoffs* list.
106
107 The algorithm name must correspond to a valid Manifest hash name.
108 An informational list of hashes is included in GLEP 74 [#GLEP74]_,
109 and the policies for introducing new hashes are covered by GLEP 59
110 [#GLEP59]_.
111
112 The cutoffs list specifies one or more integers separated by colons
113 (``:``), indicating the number of bits (starting with the most
114 significant bit) of the hash used to form subsequent subdirectory names.
115 For example, the list of ``2:4`` would indicate that top-level directory
116 names are formed using 2 most significant bits of the hash (resulting
117 in 2² = 4 directories), and each of this directories would have
118 subdirectories formed using the next 4 bits of the hash (resulting
119 in 2⁴ = 8 subdirectories each).
120
121 The exact algorithm for determining the distfile location follows:
122
123 1. Let the distfile filename be **F**.
124
125 2. Compute the hash of **F** and store its binary value as **H**.
126
127 3. For each integer **C** in cutoff list:
128
129 a. Take **C** most significant bits of **H** and store them as **V**.
130
131 b. Convert **V** into hexadecimal integer, left padded with zeros
132 to **C/4** digits (rounded up) and append it to the path, followed
133 by the path separator.
134
135 c. Shift **H** left **C** bits.
136
137 4. Finally, append **F** to the obtained path.
138
139 In particular, note that when using nested directories
140 the subdirectories do not repeat the hash bits used in parent directory.
141
142
143 Migrating mirrors to the hashed structure
144 -----------------------------------------
145 Since all distfile mirrors sync to the master Gentoo mirror, it should
146 be enough to perform all the needed changes on the master mirror
147 and wait for other mirrors to sync. The following procedure
148 is recommended:
149
150 1. Include the initial ``layout.conf`` listing only ``flat`` layout.
151
152 2. Create the new structure alongside the flat structure. Wait for
153 mirrors to sync.
154
155 3. Once all mirrors receive the new structure, update ``layout.conf``
156 to list the ``filename-hash`` structure.
157
158 4. Once a version of Portage supporting the new structure is stable long
159 enough, remove the fallback ``flat`` structure from ``layout.conf``
160 and duplicate distfiles.
161
162 This implies that during the migration period the distfiles will
163 be stored duplicated on the mirrors and therefore will occupy twice
164 as much space. Technically, this could be avoided either by using
165 hard links or symbolic links.
166
167 The hard link solution allows us to save space on the master mirror.
168 Additionally, if ``-H`` option is used by the mirrors it avoids
169 transferring existing files again. However, this option is known
170 to be expensive and could cause significant server load. Without it,
171 all mirrors need to transfer a second copy of all the existing files.
172
173 The symbolic link solution could be more reliable if we could rely
174 on mirrors using the ``--links`` rsync option. Without that, symbolic
175 links are not transferred at all.
176
177
178 Using hashed structure for local distfiles
179 ------------------------------------------
180 The hashed structure defined above could also be used for local distfile
181 storage as used by the package manager. For this to work, the package
182 manager authors need to ensure that:
183
184 a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary
185 directory where distfiles specific to the package are linked
186 in a flat structure.
187
188 b. All tools are updated to support the nested structure.
189
190 c. The package manager provides a tool for users to easily manipulate
191 distfiles, in particular to add distfiles for fetch-restricted
192 packages into an appropriate subdirectory.
193
194 For extended compatibility, the package manager may support finding
195 distfiles in flat and nested structure simultaneously.
196
197
198 Rationale
199 =========
200 Algorithm for splitting distfiles
201 ---------------------------------
202 In the original debate that occurred in bug #534528 [#BUG534528]_,
203 three possible solutions for splitting distfiles were listed:
204
205 a. using initial portion of filename,
206
207 b. using initial portion of file hash,
208
209 c. using initial portion of filename hash.
210
211 The significant advantage of the filename option was simplicity. With
212 that solution, the users could easily determine the correct subdirectory
213 themselves. However, it's significant disadvantage was very uneven
214 shuffling of data. In particular, the TeΧ Live packages alone count
215 almost 23500 distfiles and all use a common prefix, making it impossible
216 to split them further.
217
218 The alternate option of using file hash has the advantage of having
219 a more balanced split. Furthermore, since hashes are stored
220 in Manifests using them is zero-cost. However, this solution has two
221 significant disadvantages:
222
223 1. The hash values are unknown for newly-downloaded distfiles, so
224 ``repoman`` (or an equivalent tool) would have to use a temporary
225 directory before locating the file in appropriate subdirectory.
226
227 2. User-provided distfiles (e.g. for fetch-restricted packages) with
228 hash mismatches would be placed in the wrong subdirectory,
229 potentially causing confusing errors.
230
231 Using filename hashes has proven to provide a similar balance
232 to using file hashes. Furthermore, since filenames are known up front
233 this solution does not suffer from the both listed problems. While
234 hashes need to be computed manually, hashing short string should not
235 cause any performance problems.
236
237 .. figure:: glep-0075-extras/by-filename.png
238
239 Distribution of distfiles by first character of filenames
240
241 .. figure:: glep-0075-extras/by-csum.png
242
243 Distribution of distfiles by first hex-digit of checksum
244 (x --- content checksum, + --- filename checksum)
245
246 .. figure:: glep-0075-extras/by-csum2.png
247
248 Distribution of distfiles by two first hex-digits of checksum
249 (x --- content checksum, + --- filename checksum)
250
251
252 Layout file
253 -----------
254 The presence of control file has been suggested in the original
255 discussion. Its main purpose is to let package managers cleanly handle
256 the migration and detect how to correctly query the mirrors throughout
257 it. Furthermore, it makes future changes easier.
258
259 The format lines specifically mean to hardcode as little about
260 the actual algorithm as possible. Therefore, we can easily change
261 the hash used or the exact split structure without having to update
262 the package managers or even provide a compatibility layout.
263
264 The file is also open for future extensions to provide additional mirror
265 metadata. However, no clear use for that has been determined so far.
266
267
268 Hash algorithm
269 --------------
270 The hash algorithm support is fully deferred to the existing code
271 in the package managers that is required to handle Manifests.
272 In particular, it is recommended to reuse one of the hashes that are
273 used in Manifest entries at the time. This avoids code duplication
274 and reuses an existing mechanism to handle hash upgrades.
275
276 During the discussion, it has been pointed that this particular use case
277 does not require a cryptographically strong hash and a faster algorithm
278 could be used instead. However, given the short length of hashed
279 strings performance is not a problem, and speed does not justify
280 the resulting code duplication.
281
282 It has also been pointed out that e.g. the BLAKE2 hash family provides
283 the ability of creating arbitrary length hashes instead of truncating
284 the standard-length hash. However, not all implementations of BLAKE2
285 support that and relying on it could reduce portability for no apparent
286 gain.
287
288
289 Backwards Compatibility
290 =======================
291 Mirror compatibility
292 --------------------
293 The mirrored files are propagated to other mirrors as opaque directory
294 structure. Therefore, there are no backwards compatibility concerns
295 on the mirroring side.
296
297 Backwards compatibility with existing clients is detailed
298 in `migrating mirrors to the hashed structure`_ section. Backwards
299 compatibility with the old clients will be provided by preserving
300 the flat structure during the transitional period.
301
302 The new clients will fetch the ``layout.conf`` file to avoid backwards
303 compatibility concerns in the future. In case of hitting an old mirror,
304 the package manager will default to the ``flat`` structure.
305
306
307 Package manager storage compatibility
308 -------------------------------------
309 The exact means of preserving backwards compatibility in package manager
310 storage are left to the package manager authors. However, it is
311 recommended that package managers continue to support the flat layout
312 even if it is no longer the default. The package manager may either
313 continue to read files from this location or automatically move them
314 to an appropriate subdirectory.
315
316
317 Reference Implementation
318 ========================
319 TODO.
320
321
322 References
323 ==========
324 .. [#DESKTOP_FORMAT] Desktop Entry Specification: Basic format of the file
325 (https://standards.freedesktop.org/desktop-entry-spec/latest/ar01s03.html)
326
327 .. [#GLEP74] GLEP 74: Full-tree verification using Manifest files:
328 Checksum algorithms (informational)
329 (https://www.gentoo.org/glep/glep-0074.html#checksum-algorithms-informational)
330
331 .. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications
332 (https://www.gentoo.org/glep/glep-0059.html)
333
334 .. [#BUG534528] Bug 534528 - distfiles should be sorted into subdirectories
335 of DISTDIR
336 (https://bugs.gentoo.org/534528)
337
338
339 Copyright
340 =========
341 This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
342 Unported License. To view a copy of this license, visit
343 http://creativecommons.org/licenses/by-sa/3.0/.
344
345 --
346 Best regards,
347 Michał Górny

Replies