Gentoo Archives: gentoo-commits

From: "Michał Górny" <mgorny@g.o>
To: gentoo-commits@l.g.o
Subject: [gentoo-commits] data/glep:glep-mirrors commit in: /
Date: Wed, 07 Feb 2018 13:22:31
Message-Id: 1518009728.e4dc2627c8107339b13e20709125e2d9fc91ffde.mgorny@gentoo
1 commit: e4dc2627c8107339b13e20709125e2d9fc91ffde
2 Author: Michał Górny <mgorny <AT> gentoo <DOT> org>
3 AuthorDate: Wed Feb 7 13:20:45 2018 +0000
4 Commit: Michał Górny <mgorny <AT> gentoo <DOT> org>
5 CommitDate: Wed Feb 7 13:22:08 2018 +0000
6 URL: https://gitweb.gentoo.org/data/glep.git/commit/?id=e4dc2627
7
8 glep-0075: Extend rationale for splitting algorithm
9
10 Extend and refactor the rationale for splitting algorithm. Explicitly
11 state the goals, list all the options that occurred during the ml
12 discussion.
13
14 glep-0075.rst | 116 +++++++++++++++++++++++++++++++++++++++++++++-------------
15 1 file changed, 91 insertions(+), 25 deletions(-)
16
17 diff --git a/glep-0075.rst b/glep-0075.rst
18 index 157514e..00d14c3 100644
19 --- a/glep-0075.rst
20 +++ b/glep-0075.rst
21 @@ -187,43 +187,98 @@ Rationale
22 =========
23 Algorithm for splitting distfiles
24 ---------------------------------
25 -In the original debate that occurred in bug #534528 [#BUG534528]_,
26 -three possible solutions for splitting distfiles were listed:
27 +The possible algorithms were considered with the following goals
28 +in mind:
29
30 -a. using initial portion of filename,
31 +- the number of files in a single directory should not exceed 1000,
32
33 -b. using initial portion of file hash,
34 +- the total size of files in a single directory is not considered
35 + relevant,
36
37 -c. using initial portion of filename hash.
38 +- the solution should preferably be future-proof,
39
40 -The significant advantage of the filename option was simplicity. With
41 -that solution, the users could easily determine the correct subdirectory
42 -themselves. However, it's significant disadvantage was very uneven
43 -shuffling of data. In particular, the TeΧ Live packages alone count
44 -almost 23500 distfiles and all use a common prefix, making it impossible
45 -to split them further.
46 +- moving distfiles should be avoided once it is deployed.
47
48 -The alternate option of using file hash has the advantage of having
49 -a more balanced split. Furthermore, since hashes are stored
50 -in Manifests using them is zero-cost. However, this solution has three
51 -significant disadvantages:
52 +It should also be noted that at this moment the package having most
53 +distfiles in Gentoo at the time is dev-texlive/texlive-latexextra,
54 +with the number of 8556 distfiles. All of them start with a common
55 +prefix of ``texlive-module-``. This specific prefix is used by a total
56 +of 23435 distfiles.
57
58 -1. The hash values are unknown for newly-downloaded distfiles, so
59 - ``repoman`` (or an equivalent tool) would have to use a temporary
60 - directory before locating the file in appropriate subdirectory.
61 +In the original debate that occurred in bug #534528 [#BUG534528]_
62 +and the mailing list review of the initial version of this GLEP [#ML1]_,
63 +four fundamental ideas for splitting distfiles were listed:
64 +
65 +a. using initial portion of filename,
66 +
67 +b. using initial portion of file hash,
68 +
69 +c. using initial portion of filename hash,
70 +
71 +d. using package category (and package name).
72 +
73 +The initial filename idea was to use the first character of filename,
74 +possibly followed by a longer part which was the idea historically
75 +used e.g. by PyPI Python package hosting. Its main advantage is
76 +simplicity. The users can easily determine the correct subdirectory
77 +by just looking at the distfile name. Sadly, this solution is not only
78 +very uneven but does not solve the problem. As mentioned above,
79 +the TeΧ Live packages share a long common prefix that make it impossible
80 +to split it properly with other packages on fixed-length prefixes.
81 +
82 +This idea has been followed by an adaptive proposal by Andrew Barchuk
83 +[#ADAPTIVE_FILENAME]_. In this proposal, the filenames are not strictly
84 +mapped to groups by a common prefix but instead each group contains
85 +all files between two prefixes being used (like in a dictionary).
86 +However, it has been pointed out that while this option can provide
87 +very even results initially, it is impossible to predict how it would
88 +be affected by future distfile changes and there will be a risk of
89 +needing to change the groups in the future. Furthermore, it is
90 +relatively complex and requires explicitly listing or obtaining used
91 +groups.
92 +
93 +Another option was to use an initial portion of distfile hashes. Its
94 +main advantage is that cryptographic hash algorithms can provide
95 +a more balanced split with random data. Furthermore, since hashes are
96 +stored in Manifests using them has no cost for users. However, this
97 +solution has three disadvantages:
98 +
99 +1. Not all files in the distfile tree are covered by package Manifests.
100 + Additional files are injected into the mirrors, and those will
101 + not have a clearly-defined location.
102
103 2. User-provided distfiles (e.g. for fetch-restricted packages) with
104 hash mismatches would be placed in the wrong subdirectory,
105 potentially causing confusing errors.
106
107 -3. Not all files in the distfiles tree are covered by package Manifests
108 - --- there are additional files that are injected into distfiles.
109 +3. The hash values are unknown for newly-downloaded distfiles, so
110 + ``repoman`` (or an equivalent tool) would have to use a temporary
111 + directory before locating the file in appropriate subdirectory.
112
113 -Using filename hashes has proven to provide a similar balance
114 -to using file hashes. Furthermore, since filenames are known up front
115 -this solution does not suffer from the both listed problems. While
116 -hashes need to be computed manually, hashing short string should not
117 -cause any performance problems.
118 +Using filename hashes has proven to provide a similar balance to using
119 +file hashes. Furthermore, since filenames are known up front this
120 +solution does not suffer from the listed problems. While hashes need
121 +to be computed manually, hashing short string should not cause
122 +any performance problems.
123 +
124 +Jason Zaman has suggested to use package categories (and package names)
125 +[#PKGNAME]_. However, this solution has multiple problems:
126 +
127 +a. it does not solve the problem for large packages such as TeΧ Live,
128 +
129 +b. it introduces many unnecessarily small directories,
130 +
131 +c. it requires an explicit knowledge of which package distfiles
132 + belong to,
133 +
134 +d. it does not provide an explicit solution to the problem of distfiles
135 + shared by multiple packages,
136 +
137 +e. it does not provide a solution to the problem of injected distfiles.
138 +
139 +All the options considered, the filename hash solution was selected
140 +as one that solves all the forementioned problems while introducing
141 +relatively low complexity and being reasonably future-proof.
142
143 .. figure:: glep-0075-extras/by-filename.png
144
145 @@ -327,6 +382,17 @@ References
146 of DISTDIR
147 (https://bugs.gentoo.org/534528)
148
149 +.. [#ML1] [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
150 + (https://archives.gentoo.org/gentoo-dev/message/cfc4f8595df2edf9a25ba9ecae2463ba)
151 +
152 +.. [#ADAPTIVE_FILENAME] Andrew Barchuk's reply on 'using character ranges
153 + for each directory computed in a way to have the files distributed evenly'
154 + (https://archives.gentoo.org/gentoo-dev/message/611bdaa76be049c1d650e8995748e7b8)
155 +
156 +.. [#PKGNAME] Jason Zamal's reply including 'using the same dir layout
157 + as the packages themselves)
158 + (https://archives.gentoo.org/gentoo-dev/message/f26ed870c3a6d4ecf69a821723642975)
159 +
160
161 Copyright
162 =========