1 |
W dniu nie, 28.01.2018 o godzinie 15∶01 +0800, użytkownik Jason Zaman |
2 |
napisał: |
3 |
> On Sat, Jan 27, 2018 at 12:24:39AM +0100, Michał Górny wrote: |
4 |
> > Migrating mirrors to the hashed structure |
5 |
> > ----------------------------------------- |
6 |
> > The hard link solution allows us to save space on the master mirror. |
7 |
> > Additionally, if ``-H`` option is used by the mirrors it avoids |
8 |
> > transferring existing files again. However, this option is known |
9 |
> > to be expensive and could cause significant server load. Without it, |
10 |
> > all mirrors need to transfer a second copy of all the existing files. |
11 |
> > |
12 |
> > The symbolic link solution could be more reliable if we could rely |
13 |
> > on mirrors using the ``--links`` rsync option. Without that, symbolic |
14 |
> > links are not transferred at all. |
15 |
> |
16 |
> These rsync options might help for mirrors too: |
17 |
> --compare-dest=DIR also compare destination files relative to DIR |
18 |
> --copy-dest=DIR ... and include copies of unchanged files |
19 |
> --link-dest=DIR hardlink to files in DIR when unchanged |
20 |
> |
21 |
> > Using hashed structure for local distfiles |
22 |
> > ------------------------------------------ |
23 |
> > The hashed structure defined above could also be used for local distfile |
24 |
> > storage as used by the package manager. For this to work, the package |
25 |
> > manager authors need to ensure that: |
26 |
> > |
27 |
> > a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary |
28 |
> > directory where distfiles specific to the package are linked |
29 |
> > in a flat structure. |
30 |
> > |
31 |
> > b. All tools are updated to support the nested structure. |
32 |
> > |
33 |
> > c. The package manager provides a tool for users to easily manipulate |
34 |
> > distfiles, in particular to add distfiles for fetch-restricted |
35 |
> > packages into an appropriate subdirectory. |
36 |
> > |
37 |
> > For extended compatibility, the package manager may support finding |
38 |
> > distfiles in flat and nested structure simultaneously. |
39 |
> |
40 |
> trying nested first then falling back to flat would make it easy for |
41 |
> users if they have to download distfiles for fetch-restricted packages |
42 |
> because then the instructions stay as "move it to |
43 |
> /usr/portage/distfiles". |
44 |
> or alternatively the tool could have a mode which will go through all |
45 |
> files in the base dir and move it to where it should be in the nested |
46 |
> tree. then you save everything to the same dir and run edist --fix |
47 |
|
48 |
This is really outside the scope, and up to Portage maintainers. |
49 |
|
50 |
> > Rationale |
51 |
> > ========= |
52 |
> > Algorithm for splitting distfiles |
53 |
> > --------------------------------- |
54 |
> > In the original debate that occurred in bug #534528 [#BUG534528]_, |
55 |
> > three possible solutions for splitting distfiles were listed: |
56 |
> > |
57 |
> > a. using initial portion of filename, |
58 |
> > |
59 |
> > b. using initial portion of file hash, |
60 |
> > |
61 |
> > c. using initial portion of filename hash. |
62 |
> > |
63 |
> > The significant advantage of the filename option was simplicity. With |
64 |
> > that solution, the users could easily determine the correct subdirectory |
65 |
> > themselves. However, it's significant disadvantage was very uneven |
66 |
> > shuffling of data. In particular, the TeΧ Live packages alone count |
67 |
> > almost 23500 distfiles and all use a common prefix, making it impossible |
68 |
> > to split them further. |
69 |
> |
70 |
> the filename is the original upstream or the renamed one? eg |
71 |
> SRC_URI="http://foo/foo.tar -> bar.tar" it will be bar.tar? |
72 |
|
73 |
Renamed one. This is what distfiles use already. Otherwise we'd have |
74 |
a lot of collisions on files named 'v1.2.3.tar.gz'. |
75 |
|
76 |
> I think im in favour of using the initial part of the filename anyway. |
77 |
> sure its not balanced but its still a hell of a lot more balanced than |
78 |
> today and its really easy. |
79 |
|
80 |
'More balanced' does not mean it solves the problem. If you have one |
81 |
directory with ~25000 files, and others between almost empty and 4000, |
82 |
then you still have a huge problem and a lot of silly reorganization |
83 |
that looks like a 'good idea that misfired'. |
84 |
|
85 |
> Another thing im wondering is if we can just use the same dir layout as |
86 |
> the packages themselves. that would fix texlive since it has a whole lot |
87 |
> of separate packages. eg /usr/portage/distfiles/app-cat/pkg/pkg-1.0.tgz |
88 |
|
89 |
Then you're replacing the problem of many files in a single directory |
90 |
with a problem of huge number of almost empty directories. In other |
91 |
words, you replace performance problem of one kind with performance |
92 |
problem of another kind, plus potential inode problem... |
93 |
|
94 |
> there is a problem if many packages use the same distfiles (quite |
95 |
> extensive for SELinux, every single of the sec-policy/selinux-* packages |
96 |
> has identical distfiles) so im not sure how to deal with it. |
97 |
|
98 |
...and yes, the problem that we have a lot of distfiles shared between |
99 |
different packages. Also, frequently those distfiles are actually huge |
100 |
(think of big upstream tarball being split into N packages in Gentoo, |
101 |
e.g. Qt). |
102 |
|
103 |
> this would also make it easy in future to make the sandbox restrict |
104 |
> access to files outside of that package if we wanted to do that. |
105 |
|
106 |
I don't see how that's relevant at all. |
107 |
|
108 |
> > The alternate option of using file hash has the advantage of having |
109 |
> > a more balanced split. Furthermore, since hashes are stored |
110 |
> > in Manifests using them is zero-cost. However, this solution has two |
111 |
> > significant disadvantages: |
112 |
> > |
113 |
> > 1. The hash values are unknown for newly-downloaded distfiles, so |
114 |
> > ``repoman`` (or an equivalent tool) would have to use a temporary |
115 |
> > directory before locating the file in appropriate subdirectory. |
116 |
> > |
117 |
> > 2. User-provided distfiles (e.g. for fetch-restricted packages) with |
118 |
> > hash mismatches would be placed in the wrong subdirectory, |
119 |
> > potentially causing confusing errors. |
120 |
> |
121 |
> Not just this, but on principle, I also think you should be able to read |
122 |
> an ebuild and compute the url to download the file from the mirrors |
123 |
> without any extra knowledge (especially downloading the distfile). |
124 |
> |
125 |
> > Using filename hashes has proven to provide a similar balance |
126 |
> > to using file hashes. Furthermore, since filenames are known up front |
127 |
> > this solution does not suffer from the both listed problems. While |
128 |
> > hashes need to be computed manually, hashing short string should not |
129 |
> > cause any performance problems. |
130 |
> > |
131 |
> > .. figure:: glep-0075-extras/by-filename.png |
132 |
> > |
133 |
> > Distribution of distfiles by first character of filenames |
134 |
> > |
135 |
> > .. figure:: glep-0075-extras/by-csum.png |
136 |
> > |
137 |
> > Distribution of distfiles by first hex-digit of checksum |
138 |
> > (x --- content checksum, + --- filename checksum) |
139 |
> > |
140 |
> > .. figure:: glep-0075-extras/by-csum2.png |
141 |
> > |
142 |
> > Distribution of distfiles by two first hex-digits of checksum |
143 |
> > (x --- content checksum, + --- filename checksum) |
144 |
> |
145 |
> do you have an easy way to calculate how big the distfiles are per |
146 |
> category or cat/pkg? i'd be interested to see. |
147 |
|
148 |
Easy, no. But should be easy to write a script that does that. |
149 |
The sources for my stuff are at: |
150 |
|
151 |
https://github.com/mgorny/manifest-distfile-stats |
152 |
|
153 |
Except most of it won't be useful for that case since it works |
154 |
on combined and deduplicated Manifests. |
155 |
|
156 |
If you want to do that, please also include a graph of total file sizes, |
157 |
and mark how much of that is duplicated between groups. |
158 |
|
159 |
> > Backwards Compatibility |
160 |
> > ======================= |
161 |
> > Mirror compatibility |
162 |
> > -------------------- |
163 |
> > The mirrored files are propagated to other mirrors as opaque directory |
164 |
> > structure. Therefore, there are no backwards compatibility concerns |
165 |
> > on the mirroring side. |
166 |
> > |
167 |
> > Backwards compatibility with existing clients is detailed |
168 |
> > in `migrating mirrors to the hashed structure`_ section. Backwards |
169 |
> > compatibility with the old clients will be provided by preserving |
170 |
> > the flat structure during the transitional period. |
171 |
> |
172 |
> Even if there was no transition, things wouldnt be terrible because |
173 |
> portage would fall back to just downloading from SRC_URI directly |
174 |
> if the mirrors fail. |
175 |
> |
176 |
> |
177 |
|
178 |
-- |
179 |
Best regards, |
180 |
Michał Górny |