1 |
commit: e4dc2627c8107339b13e20709125e2d9fc91ffde |
2 |
Author: Michał Górny <mgorny <AT> gentoo <DOT> org> |
3 |
AuthorDate: Wed Feb 7 13:20:45 2018 +0000 |
4 |
Commit: Michał Górny <mgorny <AT> gentoo <DOT> org> |
5 |
CommitDate: Wed Feb 7 13:22:08 2018 +0000 |
6 |
URL: https://gitweb.gentoo.org/data/glep.git/commit/?id=e4dc2627 |
7 |
|
8 |
glep-0075: Extend rationale for splitting algorithm |
9 |
|
10 |
Extend and refactor the rationale for splitting algorithm. Explicitly |
11 |
state the goals, list all the options that occurred during the ml |
12 |
discussion. |
13 |
|
14 |
glep-0075.rst | 116 +++++++++++++++++++++++++++++++++++++++++++++------------- |
15 |
1 file changed, 91 insertions(+), 25 deletions(-) |
16 |
|
17 |
diff --git a/glep-0075.rst b/glep-0075.rst |
18 |
index 157514e..00d14c3 100644 |
19 |
--- a/glep-0075.rst |
20 |
+++ b/glep-0075.rst |
21 |
@@ -187,43 +187,98 @@ Rationale |
22 |
========= |
23 |
Algorithm for splitting distfiles |
24 |
--------------------------------- |
25 |
-In the original debate that occurred in bug #534528 [#BUG534528]_, |
26 |
-three possible solutions for splitting distfiles were listed: |
27 |
+The possible algorithms were considered with the following goals |
28 |
+in mind: |
29 |
|
30 |
-a. using initial portion of filename, |
31 |
+- the number of files in a single directory should not exceed 1000, |
32 |
|
33 |
-b. using initial portion of file hash, |
34 |
+- the total size of files in a single directory is not considered |
35 |
+ relevant, |
36 |
|
37 |
-c. using initial portion of filename hash. |
38 |
+- the solution should preferably be future-proof, |
39 |
|
40 |
-The significant advantage of the filename option was simplicity. With |
41 |
-that solution, the users could easily determine the correct subdirectory |
42 |
-themselves. However, it's significant disadvantage was very uneven |
43 |
-shuffling of data. In particular, the TeΧ Live packages alone count |
44 |
-almost 23500 distfiles and all use a common prefix, making it impossible |
45 |
-to split them further. |
46 |
+- moving distfiles should be avoided once it is deployed. |
47 |
|
48 |
-The alternate option of using file hash has the advantage of having |
49 |
-a more balanced split. Furthermore, since hashes are stored |
50 |
-in Manifests using them is zero-cost. However, this solution has three |
51 |
-significant disadvantages: |
52 |
+It should also be noted that at this moment the package having most |
53 |
+distfiles in Gentoo at the time is dev-texlive/texlive-latexextra, |
54 |
+with the number of 8556 distfiles. All of them start with a common |
55 |
+prefix of ``texlive-module-``. This specific prefix is used by a total |
56 |
+of 23435 distfiles. |
57 |
|
58 |
-1. The hash values are unknown for newly-downloaded distfiles, so |
59 |
- ``repoman`` (or an equivalent tool) would have to use a temporary |
60 |
- directory before locating the file in appropriate subdirectory. |
61 |
+In the original debate that occurred in bug #534528 [#BUG534528]_ |
62 |
+and the mailing list review of the initial version of this GLEP [#ML1]_, |
63 |
+four fundamental ideas for splitting distfiles were listed: |
64 |
+ |
65 |
+a. using initial portion of filename, |
66 |
+ |
67 |
+b. using initial portion of file hash, |
68 |
+ |
69 |
+c. using initial portion of filename hash, |
70 |
+ |
71 |
+d. using package category (and package name). |
72 |
+ |
73 |
+The initial filename idea was to use the first character of filename, |
74 |
+possibly followed by a longer part which was the idea historically |
75 |
+used e.g. by PyPI Python package hosting. Its main advantage is |
76 |
+simplicity. The users can easily determine the correct subdirectory |
77 |
+by just looking at the distfile name. Sadly, this solution is not only |
78 |
+very uneven but does not solve the problem. As mentioned above, |
79 |
+the TeΧ Live packages share a long common prefix that make it impossible |
80 |
+to split it properly with other packages on fixed-length prefixes. |
81 |
+ |
82 |
+This idea has been followed by an adaptive proposal by Andrew Barchuk |
83 |
+[#ADAPTIVE_FILENAME]_. In this proposal, the filenames are not strictly |
84 |
+mapped to groups by a common prefix but instead each group contains |
85 |
+all files between two prefixes being used (like in a dictionary). |
86 |
+However, it has been pointed out that while this option can provide |
87 |
+very even results initially, it is impossible to predict how it would |
88 |
+be affected by future distfile changes and there will be a risk of |
89 |
+needing to change the groups in the future. Furthermore, it is |
90 |
+relatively complex and requires explicitly listing or obtaining used |
91 |
+groups. |
92 |
+ |
93 |
+Another option was to use an initial portion of distfile hashes. Its |
94 |
+main advantage is that cryptographic hash algorithms can provide |
95 |
+a more balanced split with random data. Furthermore, since hashes are |
96 |
+stored in Manifests using them has no cost for users. However, this |
97 |
+solution has three disadvantages: |
98 |
+ |
99 |
+1. Not all files in the distfile tree are covered by package Manifests. |
100 |
+ Additional files are injected into the mirrors, and those will |
101 |
+ not have a clearly-defined location. |
102 |
|
103 |
2. User-provided distfiles (e.g. for fetch-restricted packages) with |
104 |
hash mismatches would be placed in the wrong subdirectory, |
105 |
potentially causing confusing errors. |
106 |
|
107 |
-3. Not all files in the distfiles tree are covered by package Manifests |
108 |
- --- there are additional files that are injected into distfiles. |
109 |
+3. The hash values are unknown for newly-downloaded distfiles, so |
110 |
+ ``repoman`` (or an equivalent tool) would have to use a temporary |
111 |
+ directory before locating the file in appropriate subdirectory. |
112 |
|
113 |
-Using filename hashes has proven to provide a similar balance |
114 |
-to using file hashes. Furthermore, since filenames are known up front |
115 |
-this solution does not suffer from the both listed problems. While |
116 |
-hashes need to be computed manually, hashing short string should not |
117 |
-cause any performance problems. |
118 |
+Using filename hashes has proven to provide a similar balance to using |
119 |
+file hashes. Furthermore, since filenames are known up front this |
120 |
+solution does not suffer from the listed problems. While hashes need |
121 |
+to be computed manually, hashing short string should not cause |
122 |
+any performance problems. |
123 |
+ |
124 |
+Jason Zaman has suggested to use package categories (and package names) |
125 |
+[#PKGNAME]_. However, this solution has multiple problems: |
126 |
+ |
127 |
+a. it does not solve the problem for large packages such as TeΧ Live, |
128 |
+ |
129 |
+b. it introduces many unnecessarily small directories, |
130 |
+ |
131 |
+c. it requires an explicit knowledge of which package distfiles |
132 |
+ belong to, |
133 |
+ |
134 |
+d. it does not provide an explicit solution to the problem of distfiles |
135 |
+ shared by multiple packages, |
136 |
+ |
137 |
+e. it does not provide a solution to the problem of injected distfiles. |
138 |
+ |
139 |
+All the options considered, the filename hash solution was selected |
140 |
+as one that solves all the forementioned problems while introducing |
141 |
+relatively low complexity and being reasonably future-proof. |
142 |
|
143 |
.. figure:: glep-0075-extras/by-filename.png |
144 |
|
145 |
@@ -327,6 +382,17 @@ References |
146 |
of DISTDIR |
147 |
(https://bugs.gentoo.org/534528) |
148 |
|
149 |
+.. [#ML1] [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure |
150 |
+ (https://archives.gentoo.org/gentoo-dev/message/cfc4f8595df2edf9a25ba9ecae2463ba) |
151 |
+ |
152 |
+.. [#ADAPTIVE_FILENAME] Andrew Barchuk's reply on 'using character ranges |
153 |
+ for each directory computed in a way to have the files distributed evenly' |
154 |
+ (https://archives.gentoo.org/gentoo-dev/message/611bdaa76be049c1d650e8995748e7b8) |
155 |
+ |
156 |
+.. [#PKGNAME] Jason Zamal's reply including 'using the same dir layout |
157 |
+ as the packages themselves) |
158 |
+ (https://archives.gentoo.org/gentoo-dev/message/f26ed870c3a6d4ecf69a821723642975) |
159 |
+ |
160 |
|
161 |
Copyright |
162 |
========= |