1 |
On Sat, Jan 27, 2018 at 12:24:39AM +0100, Michał Górny wrote: |
2 |
> Migrating mirrors to the hashed structure |
3 |
> ----------------------------------------- |
4 |
|
5 |
> The hard link solution allows us to save space on the master mirror. |
6 |
> Additionally, if ``-H`` option is used by the mirrors it avoids |
7 |
> transferring existing files again. However, this option is known |
8 |
> to be expensive and could cause significant server load. Without it, |
9 |
> all mirrors need to transfer a second copy of all the existing files. |
10 |
> |
11 |
> The symbolic link solution could be more reliable if we could rely |
12 |
> on mirrors using the ``--links`` rsync option. Without that, symbolic |
13 |
> links are not transferred at all. |
14 |
|
15 |
These rsync options might help for mirrors too: |
16 |
--compare-dest=DIR also compare destination files relative to DIR |
17 |
--copy-dest=DIR ... and include copies of unchanged files |
18 |
--link-dest=DIR hardlink to files in DIR when unchanged |
19 |
|
20 |
> Using hashed structure for local distfiles |
21 |
> ------------------------------------------ |
22 |
> The hashed structure defined above could also be used for local distfile |
23 |
> storage as used by the package manager. For this to work, the package |
24 |
> manager authors need to ensure that: |
25 |
> |
26 |
> a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary |
27 |
> directory where distfiles specific to the package are linked |
28 |
> in a flat structure. |
29 |
> |
30 |
> b. All tools are updated to support the nested structure. |
31 |
> |
32 |
> c. The package manager provides a tool for users to easily manipulate |
33 |
> distfiles, in particular to add distfiles for fetch-restricted |
34 |
> packages into an appropriate subdirectory. |
35 |
> |
36 |
> For extended compatibility, the package manager may support finding |
37 |
> distfiles in flat and nested structure simultaneously. |
38 |
|
39 |
trying nested first then falling back to flat would make it easy for |
40 |
users if they have to download distfiles for fetch-restricted packages |
41 |
because then the instructions stay as "move it to |
42 |
/usr/portage/distfiles". |
43 |
or alternatively the tool could have a mode which will go through all |
44 |
files in the base dir and move it to where it should be in the nested |
45 |
tree. then you save everything to the same dir and run edist --fix |
46 |
|
47 |
> Rationale |
48 |
> ========= |
49 |
> Algorithm for splitting distfiles |
50 |
> --------------------------------- |
51 |
> In the original debate that occurred in bug #534528 [#BUG534528]_, |
52 |
> three possible solutions for splitting distfiles were listed: |
53 |
> |
54 |
> a. using initial portion of filename, |
55 |
> |
56 |
> b. using initial portion of file hash, |
57 |
> |
58 |
> c. using initial portion of filename hash. |
59 |
> |
60 |
> The significant advantage of the filename option was simplicity. With |
61 |
> that solution, the users could easily determine the correct subdirectory |
62 |
> themselves. However, it's significant disadvantage was very uneven |
63 |
> shuffling of data. In particular, the TeΧ Live packages alone count |
64 |
> almost 23500 distfiles and all use a common prefix, making it impossible |
65 |
> to split them further. |
66 |
|
67 |
the filename is the original upstream or the renamed one? eg |
68 |
SRC_URI="http://foo/foo.tar -> bar.tar" it will be bar.tar? |
69 |
|
70 |
I think im in favour of using the initial part of the filename anyway. |
71 |
sure its not balanced but its still a hell of a lot more balanced than |
72 |
today and its really easy. |
73 |
|
74 |
Another thing im wondering is if we can just use the same dir layout as |
75 |
the packages themselves. that would fix texlive since it has a whole lot |
76 |
of separate packages. eg /usr/portage/distfiles/app-cat/pkg/pkg-1.0.tgz |
77 |
|
78 |
there is a problem if many packages use the same distfiles (quite |
79 |
extensive for SELinux, every single of the sec-policy/selinux-* packages |
80 |
has identical distfiles) so im not sure how to deal with it. |
81 |
|
82 |
this would also make it easy in future to make the sandbox restrict |
83 |
access to files outside of that package if we wanted to do that. |
84 |
|
85 |
> The alternate option of using file hash has the advantage of having |
86 |
> a more balanced split. Furthermore, since hashes are stored |
87 |
> in Manifests using them is zero-cost. However, this solution has two |
88 |
> significant disadvantages: |
89 |
> |
90 |
> 1. The hash values are unknown for newly-downloaded distfiles, so |
91 |
> ``repoman`` (or an equivalent tool) would have to use a temporary |
92 |
> directory before locating the file in appropriate subdirectory. |
93 |
> |
94 |
> 2. User-provided distfiles (e.g. for fetch-restricted packages) with |
95 |
> hash mismatches would be placed in the wrong subdirectory, |
96 |
> potentially causing confusing errors. |
97 |
|
98 |
Not just this, but on principle, I also think you should be able to read |
99 |
an ebuild and compute the url to download the file from the mirrors |
100 |
without any extra knowledge (especially downloading the distfile). |
101 |
|
102 |
> Using filename hashes has proven to provide a similar balance |
103 |
> to using file hashes. Furthermore, since filenames are known up front |
104 |
> this solution does not suffer from the both listed problems. While |
105 |
> hashes need to be computed manually, hashing short string should not |
106 |
> cause any performance problems. |
107 |
> |
108 |
> .. figure:: glep-0075-extras/by-filename.png |
109 |
> |
110 |
> Distribution of distfiles by first character of filenames |
111 |
> |
112 |
> .. figure:: glep-0075-extras/by-csum.png |
113 |
> |
114 |
> Distribution of distfiles by first hex-digit of checksum |
115 |
> (x --- content checksum, + --- filename checksum) |
116 |
> |
117 |
> .. figure:: glep-0075-extras/by-csum2.png |
118 |
> |
119 |
> Distribution of distfiles by two first hex-digits of checksum |
120 |
> (x --- content checksum, + --- filename checksum) |
121 |
|
122 |
do you have an easy way to calculate how big the distfiles are per |
123 |
category or cat/pkg? i'd be interested to see. |
124 |
|
125 |
> Backwards Compatibility |
126 |
> ======================= |
127 |
> Mirror compatibility |
128 |
> -------------------- |
129 |
> The mirrored files are propagated to other mirrors as opaque directory |
130 |
> structure. Therefore, there are no backwards compatibility concerns |
131 |
> on the mirroring side. |
132 |
> |
133 |
> Backwards compatibility with existing clients is detailed |
134 |
> in `migrating mirrors to the hashed structure`_ section. Backwards |
135 |
> compatibility with the old clients will be provided by preserving |
136 |
> the flat structure during the transitional period. |
137 |
|
138 |
Even if there was no transition, things wouldnt be terrible because |
139 |
portage would fall back to just downloading from SRC_URI directly |
140 |
if the mirrors fail. |