1 |
commit: c20d199e05de86d4430e069d13fc7ee64b73195b |
2 |
Author: Michał Górny <mgorny <AT> gentoo <DOT> org> |
3 |
AuthorDate: Fri Jan 26 23:14:39 2018 +0000 |
4 |
Commit: Ulrich Müller <ulm <AT> gentoo <DOT> org> |
5 |
CommitDate: Wed Feb 7 13:22:05 2018 +0000 |
6 |
URL: https://gitweb.gentoo.org/data/glep.git/commit/?id=c20d199e |
7 |
|
8 |
glep-0075: Split distfile mirror directory structure |
9 |
|
10 |
glep-0075-extras/by-csum.png | Bin 0 -> 7118 bytes |
11 |
glep-0075-extras/by-csum2.png | Bin 0 -> 13362 bytes |
12 |
glep-0075-extras/by-filename.png | Bin 0 -> 6871 bytes |
13 |
glep-0075.rst | 331 +++++++++++++++++++++++++++++++++++++++ |
14 |
4 files changed, 331 insertions(+) |
15 |
|
16 |
diff --git a/glep-0075-extras/by-csum.png b/glep-0075-extras/by-csum.png |
17 |
new file mode 100644 |
18 |
index 0000000..14e8cff |
19 |
Binary files /dev/null and b/glep-0075-extras/by-csum.png differ |
20 |
|
21 |
diff --git a/glep-0075-extras/by-csum2.png b/glep-0075-extras/by-csum2.png |
22 |
new file mode 100644 |
23 |
index 0000000..64c682d |
24 |
Binary files /dev/null and b/glep-0075-extras/by-csum2.png differ |
25 |
|
26 |
diff --git a/glep-0075-extras/by-filename.png b/glep-0075-extras/by-filename.png |
27 |
new file mode 100644 |
28 |
index 0000000..a6da06d |
29 |
Binary files /dev/null and b/glep-0075-extras/by-filename.png differ |
30 |
|
31 |
diff --git a/glep-0075.rst b/glep-0075.rst |
32 |
new file mode 100644 |
33 |
index 0000000..c0a1d5c |
34 |
--- /dev/null |
35 |
+++ b/glep-0075.rst |
36 |
@@ -0,0 +1,331 @@ |
37 |
+--- |
38 |
+GLEP: 75 |
39 |
+Title: Split distfile mirror directory structure |
40 |
+Author: Michał Górny <mgorny@g.o>, |
41 |
+ Robin H. Johnson <robbat2@g.o> |
42 |
+Type: Standards Track |
43 |
+Status: Draft |
44 |
+Version: 1 |
45 |
+Created: 2018-01-26 |
46 |
+Last-Modified: 2018-01-27 |
47 |
+Post-History: 2018-01-27 |
48 |
+Content-Type: text/x-rst |
49 |
+--- |
50 |
+ |
51 |
+Abstract |
52 |
+======== |
53 |
+This GLEP describes the procedure for splitting the distfiles on mirrors |
54 |
+into multiple directories with the goal of reducing the number of files |
55 |
+in a single directory. |
56 |
+ |
57 |
+ |
58 |
+Motivation |
59 |
+========== |
60 |
+At the moment, both the package manager and Gentoo mirrors use flat |
61 |
+directory structure to store files. While this solution usually works, |
62 |
+it does not scale well. Directories with large number of files usually |
63 |
+have significant performance penalty, unless using filesystems |
64 |
+specifically designed for that purpose. |
65 |
+ |
66 |
+According to the Gentoo repository state at 2018-01-26 16:23, there |
67 |
+was a total of 62652 unique distfiles in the repository. While |
68 |
+the users realistically hit around 10% of that, distfile mirrors often |
69 |
+hold even more files --- more so if old distfiles are not wiped |
70 |
+immediately. |
71 |
+ |
72 |
+While all filesystems used on Linux boxes should be able to cope with |
73 |
+a number that large, they may suffer a performance penalty with even |
74 |
+a few thousand files. Additionally, if mirrors enable directory indexes |
75 |
+then generating the index imposes both a significant server overhead |
76 |
+and a significant data transfer. At this moment, the index |
77 |
+of distfiles.gentoo.org has around 17 MiB. |
78 |
+ |
79 |
+Splitting the distfiles into multiple directories makes it possible |
80 |
+to avoid those problems by reducing the number of files in a single |
81 |
+directory. For example, splitting the forementioned set of distfiles |
82 |
+into 16 directories that are roughly balanced allows to reduce |
83 |
+the number of files in a single directory to around 4000. Splitting |
84 |
+them further into 256 directories (16x16) results in 200-300 files |
85 |
+per directory which should avoid any performance problems long-term, |
86 |
+even assuming 300% growth of number of distfiles. |
87 |
+ |
88 |
+ |
89 |
+Specification |
90 |
+============= |
91 |
+Mirror layout file |
92 |
+------------------ |
93 |
+A mirror adhering to this specification should include a ``layout.conf`` |
94 |
+file in the top distfile directory. This file uses the format |
95 |
+derived from the freedesktop Desktop Entry Specification file format |
96 |
+[#DESKTOP_FORMAT]_. |
97 |
+ |
98 |
+Before using each Gentoo mirror, the package manager should attempt |
99 |
+to fetch (update) its ``layout.conf`` file and process it to determine |
100 |
+how to use the mirror. If the file is not present, the package manager |
101 |
+should behave as if it were empty. |
102 |
+ |
103 |
+The package manager should recognize the sections and keys listed below. |
104 |
+It should ignore any unrecognized sections or keys --- the format |
105 |
+is intended to account for future extensions. |
106 |
+ |
107 |
+This specification currently defines one section: ``[structure]``. |
108 |
+This section defines one or more repository structure definitions |
109 |
+using sequential integer keys. The definition keyed as ``0`` |
110 |
+is the most preferred structure. The package manager should use |
111 |
+the first structure format it recognizes as supported, and ignore any |
112 |
+it does not recognize. If this section is not present, the package |
113 |
+manager should behave as if only ``flat`` structure were supported. |
114 |
+ |
115 |
+The following structure definitions are supported: |
116 |
+ |
117 |
+* ``flat`` to indicate the traditional flat structure where all |
118 |
+ distfiles are located in the top directory, |
119 |
+ |
120 |
+* ``filename-hash <algorithm> <cutoffs>`` to indicate the `filename |
121 |
+ hash structure`_ explained below. |
122 |
+ |
123 |
+ |
124 |
+Filename hash structure |
125 |
+----------------------- |
126 |
+When using the filename hash structure, the distfiles are split |
127 |
+into directories whose names are derived from the hash of distfile |
128 |
+filename. This structure has two parameters: *algorithm name* |
129 |
+and *cutoffs* list. |
130 |
+ |
131 |
+The algorithm name must correspond to a valid Manifest hash name. |
132 |
+An informational list of hashes is included in GLEP 74 [#GLEP74]_, |
133 |
+and the policies for introducing new hashes are covered by GLEP 59 |
134 |
+[#GLEP59]_. |
135 |
+ |
136 |
+The cutoffs list specifies one or more integers separated by colons |
137 |
+(``:``), indicating the number of bits (starting with the most |
138 |
+significant bit) of the hash used to form subsequent subdirectory names. |
139 |
+For example, the list of ``2:4`` would indicate that top-level directory |
140 |
+names are formed using 2 most significant bits of the hash (resulting |
141 |
+in 2² = 4 directories), and each of this directories would have |
142 |
+subdirectories formed using the next 4 bits of the hash (resulting |
143 |
+in 2⁴ = 8 subdirectories each). |
144 |
+ |
145 |
+The exact algorithm for determining the distfile location follows: |
146 |
+ |
147 |
+1. Let the distfile filename be **F**. |
148 |
+ |
149 |
+2. Compute the hash of **F** and store its binary value as **H**. |
150 |
+ |
151 |
+3. For each integer **C** in cutoff list: |
152 |
+ |
153 |
+ a. Take **C** most significant bits of **H** and store them as **V**. |
154 |
+ |
155 |
+ b. Convert **V** into hexadecimal integer, left padded with zeros |
156 |
+ to **C/4** digits (rounded up) and append it to the path, followed |
157 |
+ by the path separator. |
158 |
+ |
159 |
+ c. Shift **H** left **C** bits. |
160 |
+ |
161 |
+4. Finally, append **F** to the obtained path. |
162 |
+ |
163 |
+In particular, note that when using nested directories |
164 |
+the subdirectories do not repeat the hash bits used in parent directory. |
165 |
+ |
166 |
+ |
167 |
+Migrating mirrors to the hashed structure |
168 |
+----------------------------------------- |
169 |
+Since all distfile mirrors sync to the master Gentoo mirror, it should |
170 |
+be enough to perform all the needed changes on the master mirror |
171 |
+and wait for other mirrors to sync. The following procedure |
172 |
+is recommended: |
173 |
+ |
174 |
+1. Include the initial ``layout.conf`` listing only ``flat`` layout. |
175 |
+ |
176 |
+2. Create the new structure alongside the flat structure. Wait for |
177 |
+ mirrors to sync. |
178 |
+ |
179 |
+3. Once all mirrors receive the new structure, update ``layout.conf`` |
180 |
+ to list the ``filename-hash`` structure. |
181 |
+ |
182 |
+4. Once a version of Portage supporting the new structure is stable long |
183 |
+ enough, remove the fallback ``flat`` structure from ``layout.conf`` |
184 |
+ and duplicate distfiles. |
185 |
+ |
186 |
+This implies that during the migration period the distfiles will |
187 |
+be stored duplicated on the mirrors and therefore will occupy twice |
188 |
+as much space. Technically, this could be avoided either by using |
189 |
+hard links or symbolic links. |
190 |
+ |
191 |
+The hard link solution allows us to save space on the master mirror. |
192 |
+Additionally, if ``-H`` option is used by the mirrors it avoids |
193 |
+transferring existing files again. However, this option is known |
194 |
+to be expensive and could cause significant server load. Without it, |
195 |
+all mirrors need to transfer a second copy of all the existing files. |
196 |
+ |
197 |
+The symbolic link solution could be more reliable if we could rely |
198 |
+on mirrors using the ``--links`` rsync option. Without that, symbolic |
199 |
+links are not transferred at all. |
200 |
+ |
201 |
+ |
202 |
+Using hashed structure for local distfiles |
203 |
+------------------------------------------ |
204 |
+The hashed structure defined above could also be used for local distfile |
205 |
+storage as used by the package manager. For this to work, the package |
206 |
+manager authors need to ensure that: |
207 |
+ |
208 |
+a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary |
209 |
+ directory where distfiles specific to the package are linked |
210 |
+ in a flat structure. |
211 |
+ |
212 |
+b. All tools are updated to support the nested structure. |
213 |
+ |
214 |
+c. The package manager provides a tool for users to easily manipulate |
215 |
+ distfiles, in particular to add distfiles for fetch-restricted |
216 |
+ packages into an appropriate subdirectory. |
217 |
+ |
218 |
+For extended compatibility, the package manager may support finding |
219 |
+distfiles in flat and nested structure simultaneously. |
220 |
+ |
221 |
+ |
222 |
+Rationale |
223 |
+========= |
224 |
+Algorithm for splitting distfiles |
225 |
+--------------------------------- |
226 |
+In the original debate that occurred in bug #534528 [#BUG534528]_, |
227 |
+three possible solutions for splitting distfiles were listed: |
228 |
+ |
229 |
+a. using initial portion of filename, |
230 |
+ |
231 |
+b. using initial portion of file hash, |
232 |
+ |
233 |
+c. using initial portion of filename hash. |
234 |
+ |
235 |
+The significant advantage of the filename option was simplicity. With |
236 |
+that solution, the users could easily determine the correct subdirectory |
237 |
+themselves. However, it's significant disadvantage was very uneven |
238 |
+shuffling of data. In particular, the TeΧ Live packages alone count |
239 |
+almost 23500 distfiles and all use a common prefix, making it impossible |
240 |
+to split them further. |
241 |
+ |
242 |
+The alternate option of using file hash has the advantage of having |
243 |
+a more balanced split. Furthermore, since hashes are stored |
244 |
+in Manifests using them is zero-cost. However, this solution has two |
245 |
+significant disadvantages: |
246 |
+ |
247 |
+1. The hash values are unknown for newly-downloaded distfiles, so |
248 |
+ ``repoman`` (or an equivalent tool) would have to use a temporary |
249 |
+ directory before locating the file in appropriate subdirectory. |
250 |
+ |
251 |
+2. User-provided distfiles (e.g. for fetch-restricted packages) with |
252 |
+ hash mismatches would be placed in the wrong subdirectory, |
253 |
+ potentially causing confusing errors. |
254 |
+ |
255 |
+Using filename hashes has proven to provide a similar balance |
256 |
+to using file hashes. Furthermore, since filenames are known up front |
257 |
+this solution does not suffer from the both listed problems. While |
258 |
+hashes need to be computed manually, hashing short string should not |
259 |
+cause any performance problems. |
260 |
+ |
261 |
+.. figure:: glep-0075-extras/by-filename.png |
262 |
+ |
263 |
+ Distribution of distfiles by first character of filenames |
264 |
+ |
265 |
+.. figure:: glep-0075-extras/by-csum.png |
266 |
+ |
267 |
+ Distribution of distfiles by first hex-digit of checksum |
268 |
+ (x --- content checksum, + --- filename checksum) |
269 |
+ |
270 |
+.. figure:: glep-0075-extras/by-csum2.png |
271 |
+ |
272 |
+ Distribution of distfiles by two first hex-digits of checksum |
273 |
+ (x --- content checksum, + --- filename checksum) |
274 |
+ |
275 |
+ |
276 |
+Layout file |
277 |
+----------- |
278 |
+The presence of control file has been suggested in the original |
279 |
+discussion. Its main purpose is to let package managers cleanly handle |
280 |
+the migration and detect how to correctly query the mirrors throughout |
281 |
+it. Furthermore, it makes future changes easier. |
282 |
+ |
283 |
+The format lines specifically mean to hardcode as little about |
284 |
+the actual algorithm as possible. Therefore, we can easily change |
285 |
+the hash used or the exact split structure without having to update |
286 |
+the package managers or even provide a compatibility layout. |
287 |
+ |
288 |
+The file is also open for future extensions to provide additional mirror |
289 |
+metadata. However, no clear use for that has been determined so far. |
290 |
+ |
291 |
+ |
292 |
+Hash algorithm |
293 |
+-------------- |
294 |
+The hash algorithm support is fully deferred to the existing code |
295 |
+in the package managers that is required to handle Manifests. |
296 |
+In particular, it is recommended to reuse one of the hashes that are |
297 |
+used in Manifest entries at the time. This avoids code duplication |
298 |
+and reuses an existing mechanism to handle hash upgrades. |
299 |
+ |
300 |
+During the discussion, it has been pointed that this particular use case |
301 |
+does not require a cryptographically strong hash and a faster algorithm |
302 |
+could be used instead. However, given the short length of hashed |
303 |
+strings performance is not a problem, and speed does not justify |
304 |
+the resulting code duplication. |
305 |
+ |
306 |
+It has also been pointed out that e.g. the BLAKE2 hash family provides |
307 |
+the ability of creating arbitrary length hashes instead of truncating |
308 |
+the standard-length hash. However, not all implementations of BLAKE2 |
309 |
+support that and relying on it could reduce portability for no apparent |
310 |
+gain. |
311 |
+ |
312 |
+ |
313 |
+Backwards Compatibility |
314 |
+======================= |
315 |
+Mirror compatibility |
316 |
+-------------------- |
317 |
+The mirrored files are propagated to other mirrors as opaque directory |
318 |
+structure. Therefore, there are no backwards compatibility concerns |
319 |
+on the mirroring side. |
320 |
+ |
321 |
+Backwards compatibility with existing clients is detailed |
322 |
+in `migrating mirrors to the hashed structure`_ section. Backwards |
323 |
+compatibility with the old clients will be provided by preserving |
324 |
+the flat structure during the transitional period. |
325 |
+ |
326 |
+The new clients will fetch the ``layout.conf`` file to avoid backwards |
327 |
+compatibility concerns in the future. In case of hitting an old mirror, |
328 |
+the package manager will default to the ``flat`` structure. |
329 |
+ |
330 |
+ |
331 |
+Package manager storage compatibility |
332 |
+------------------------------------- |
333 |
+The exact means of preserving backwards compatibility in package manager |
334 |
+storage are left to the package manager authors. However, it is |
335 |
+recommended that package managers continue to support the flat layout |
336 |
+even if it is no longer the default. The package manager may either |
337 |
+continue to read files from this location or automatically move them |
338 |
+to an appropriate subdirectory. |
339 |
+ |
340 |
+ |
341 |
+Reference Implementation |
342 |
+======================== |
343 |
+TODO. |
344 |
+ |
345 |
+ |
346 |
+References |
347 |
+========== |
348 |
+.. [#DESKTOP_FORMAT] Desktop Entry Specification: Basic format of the file |
349 |
+ (https://standards.freedesktop.org/desktop-entry-spec/latest/ar01s03.html) |
350 |
+ |
351 |
+.. [#GLEP74] GLEP 74: Full-tree verification using Manifest files: |
352 |
+ Checksum algorithms (informational) |
353 |
+ (https://www.gentoo.org/glep/glep-0074.html#checksum-algorithms-informational) |
354 |
+ |
355 |
+.. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications |
356 |
+ (https://www.gentoo.org/glep/glep-0059.html) |
357 |
+ |
358 |
+.. [#BUG534528] Bug 534528 - distfiles should be sorted into subdirectories |
359 |
+ of DISTDIR |
360 |
+ (https://bugs.gentoo.org/534528) |
361 |
+ |
362 |
+ |
363 |
+Copyright |
364 |
+========= |
365 |
+This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 |
366 |
+Unported License. To view a copy of this license, visit |
367 |
+http://creativecommons.org/licenses/by-sa/3.0/. |