Gentoo Archives: gentoo-dev

From: "Michał Górny" <mgorny@g.o>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
Date: Sun, 28 Jan 2018 09:10:54
Message-Id: 1517130643.1270.11.camel@gentoo.org
In Reply to: Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure by Jason Zaman
1 W dniu nie, 28.01.2018 o godzinie 15∶01 +0800, użytkownik Jason Zaman
2 napisał:
3 > On Sat, Jan 27, 2018 at 12:24:39AM +0100, Michał Górny wrote:
4 > > Migrating mirrors to the hashed structure
5 > > -----------------------------------------
6 > > The hard link solution allows us to save space on the master mirror.
7 > > Additionally, if ``-H`` option is used by the mirrors it avoids
8 > > transferring existing files again. However, this option is known
9 > > to be expensive and could cause significant server load. Without it,
10 > > all mirrors need to transfer a second copy of all the existing files.
11 > >
12 > > The symbolic link solution could be more reliable if we could rely
13 > > on mirrors using the ``--links`` rsync option. Without that, symbolic
14 > > links are not transferred at all.
15 >
16 > These rsync options might help for mirrors too:
17 > --compare-dest=DIR also compare destination files relative to DIR
18 > --copy-dest=DIR ... and include copies of unchanged files
19 > --link-dest=DIR hardlink to files in DIR when unchanged
20 >
21 > > Using hashed structure for local distfiles
22 > > ------------------------------------------
23 > > The hashed structure defined above could also be used for local distfile
24 > > storage as used by the package manager. For this to work, the package
25 > > manager authors need to ensure that:
26 > >
27 > > a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary
28 > > directory where distfiles specific to the package are linked
29 > > in a flat structure.
30 > >
31 > > b. All tools are updated to support the nested structure.
32 > >
33 > > c. The package manager provides a tool for users to easily manipulate
34 > > distfiles, in particular to add distfiles for fetch-restricted
35 > > packages into an appropriate subdirectory.
36 > >
37 > > For extended compatibility, the package manager may support finding
38 > > distfiles in flat and nested structure simultaneously.
39 >
40 > trying nested first then falling back to flat would make it easy for
41 > users if they have to download distfiles for fetch-restricted packages
42 > because then the instructions stay as "move it to
43 > /usr/portage/distfiles".
44 > or alternatively the tool could have a mode which will go through all
45 > files in the base dir and move it to where it should be in the nested
46 > tree. then you save everything to the same dir and run edist --fix
47
48 This is really outside the scope, and up to Portage maintainers.
49
50 > > Rationale
51 > > =========
52 > > Algorithm for splitting distfiles
53 > > ---------------------------------
54 > > In the original debate that occurred in bug #534528 [#BUG534528]_,
55 > > three possible solutions for splitting distfiles were listed:
56 > >
57 > > a. using initial portion of filename,
58 > >
59 > > b. using initial portion of file hash,
60 > >
61 > > c. using initial portion of filename hash.
62 > >
63 > > The significant advantage of the filename option was simplicity. With
64 > > that solution, the users could easily determine the correct subdirectory
65 > > themselves. However, it's significant disadvantage was very uneven
66 > > shuffling of data. In particular, the TeΧ Live packages alone count
67 > > almost 23500 distfiles and all use a common prefix, making it impossible
68 > > to split them further.
69 >
70 > the filename is the original upstream or the renamed one? eg
71 > SRC_URI="http://foo/foo.tar -> bar.tar" it will be bar.tar?
72
73 Renamed one. This is what distfiles use already. Otherwise we'd have
74 a lot of collisions on files named 'v1.2.3.tar.gz'.
75
76 > I think im in favour of using the initial part of the filename anyway.
77 > sure its not balanced but its still a hell of a lot more balanced than
78 > today and its really easy.
79
80 'More balanced' does not mean it solves the problem. If you have one
81 directory with ~25000 files, and others between almost empty and 4000,
82 then you still have a huge problem and a lot of silly reorganization
83 that looks like a 'good idea that misfired'.
84
85 > Another thing im wondering is if we can just use the same dir layout as
86 > the packages themselves. that would fix texlive since it has a whole lot
87 > of separate packages. eg /usr/portage/distfiles/app-cat/pkg/pkg-1.0.tgz
88
89 Then you're replacing the problem of many files in a single directory
90 with a problem of huge number of almost empty directories. In other
91 words, you replace performance problem of one kind with performance
92 problem of another kind, plus potential inode problem...
93
94 > there is a problem if many packages use the same distfiles (quite
95 > extensive for SELinux, every single of the sec-policy/selinux-* packages
96 > has identical distfiles) so im not sure how to deal with it.
97
98 ...and yes, the problem that we have a lot of distfiles shared between
99 different packages. Also, frequently those distfiles are actually huge
100 (think of big upstream tarball being split into N packages in Gentoo,
101 e.g. Qt).
102
103 > this would also make it easy in future to make the sandbox restrict
104 > access to files outside of that package if we wanted to do that.
105
106 I don't see how that's relevant at all.
107
108 > > The alternate option of using file hash has the advantage of having
109 > > a more balanced split. Furthermore, since hashes are stored
110 > > in Manifests using them is zero-cost. However, this solution has two
111 > > significant disadvantages:
112 > >
113 > > 1. The hash values are unknown for newly-downloaded distfiles, so
114 > > ``repoman`` (or an equivalent tool) would have to use a temporary
115 > > directory before locating the file in appropriate subdirectory.
116 > >
117 > > 2. User-provided distfiles (e.g. for fetch-restricted packages) with
118 > > hash mismatches would be placed in the wrong subdirectory,
119 > > potentially causing confusing errors.
120 >
121 > Not just this, but on principle, I also think you should be able to read
122 > an ebuild and compute the url to download the file from the mirrors
123 > without any extra knowledge (especially downloading the distfile).
124 >
125 > > Using filename hashes has proven to provide a similar balance
126 > > to using file hashes. Furthermore, since filenames are known up front
127 > > this solution does not suffer from the both listed problems. While
128 > > hashes need to be computed manually, hashing short string should not
129 > > cause any performance problems.
130 > >
131 > > .. figure:: glep-0075-extras/by-filename.png
132 > >
133 > > Distribution of distfiles by first character of filenames
134 > >
135 > > .. figure:: glep-0075-extras/by-csum.png
136 > >
137 > > Distribution of distfiles by first hex-digit of checksum
138 > > (x --- content checksum, + --- filename checksum)
139 > >
140 > > .. figure:: glep-0075-extras/by-csum2.png
141 > >
142 > > Distribution of distfiles by two first hex-digits of checksum
143 > > (x --- content checksum, + --- filename checksum)
144 >
145 > do you have an easy way to calculate how big the distfiles are per
146 > category or cat/pkg? i'd be interested to see.
147
148 Easy, no. But should be easy to write a script that does that.
149 The sources for my stuff are at:
150
151 https://github.com/mgorny/manifest-distfile-stats
152
153 Except most of it won't be useful for that case since it works
154 on combined and deduplicated Manifests.
155
156 If you want to do that, please also include a graph of total file sizes,
157 and mark how much of that is duplicated between groups.
158
159 > > Backwards Compatibility
160 > > =======================
161 > > Mirror compatibility
162 > > --------------------
163 > > The mirrored files are propagated to other mirrors as opaque directory
164 > > structure. Therefore, there are no backwards compatibility concerns
165 > > on the mirroring side.
166 > >
167 > > Backwards compatibility with existing clients is detailed
168 > > in `migrating mirrors to the hashed structure`_ section. Backwards
169 > > compatibility with the old clients will be provided by preserving
170 > > the flat structure during the transitional period.
171 >
172 > Even if there was no transition, things wouldnt be terrible because
173 > portage would fall back to just downloading from SRC_URI directly
174 > if the mirrors fail.
175 >
176 >
177
178 --
179 Best regards,
180 Michał Górny