Gentoo Archives: gentoo-portage-dev

From: Alec Warner <antarus@g.o>
To: gentoo-portage-dev@l.g.o
Subject: Re: [gentoo-portage-dev] [RFC] Improving Gentoo package format
Date: Sun, 11 Nov 2018 13:43:36
Message-Id: CAAr7Pr-7_SQmAD79ovrE9+yqpB=fk6snVP0m7M_e0m+eC6HSxw@mail.gmail.com
In Reply to: Re: [gentoo-portage-dev] [RFC] Improving Gentoo package format by "Michał Górny"
1 On Sun, Nov 11, 2018 at 3:29 AM Michał Górny <mgorny@g.o> wrote:
2
3 > On Sat, 2018-11-10 at 09:37 -0500, Alec Warner wrote:
4 > > On Sat, Nov 10, 2018 at 8:09 AM Michał Górny <mgorny@g.o> wrote:
5 > >
6 > > > Hi, everyone.
7 > > >
8 > > > The Gentoo's tbz2/xpak package format is quite old. We've made a few
9 > > > incompatible changes in the past (most notably, allowing non-bzip2
10 > > > compression and multi-instance naming) but the core design stayed
11 > > > the same. I think we should consider changing it, for the reasons
12 > > > outlined below.
13 > > >
14 > > > The rough format description can be found in xpak(5). Basically, it's
15 > > > a regular compressed tarball with binary metadata blob appended
16 > > > to the end. As such, it looks like a regular compressed tarball
17 > > > to the compression tools (with some ignored junk at the end).
18 > > > The metadata is entirely custom format and needs dedicated tools
19 > > > to manipulate.
20 > > >
21 > > >
22 > > > The current format has a few advantages whose preserving would probably
23 > > > be worthwhile:
24 > > >
25 > > > + The binary package is a single flat file.
26 > > >
27 > > > + It is reasonably compatible with regular compressed tarball,
28 > > > so the users can unpack it using standard tools (except for metadata).
29 > > >
30 > > > + The metadata is uncompressed and can be quickly found without
31 > touching
32 > > > the compressed data.
33 > > >
34 > > > + The metadata can be updated (e.g. as result of pkgmove) without
35 > > > touching the compressed data.
36 > > >
37 > > >
38 > > > However, it has a few disadvantages as well:
39 > > >
40 > > > - The metadata is entirely custom binary format, requiring dedicated
41 > > > tools to read or edit.
42 > > >
43 > > > - The metadata format is relying on customary behavior of compression
44 > > > tools that ignore junk following the compressed data.
45 > > >
46 > >
47 > > I agree this is a problem in theory, but I haven't seen it as a problem
48 > in
49 > > practice. Have you observed any problems around this setup?
50 >
51 > Historically one of the parallel compressor variants did not support
52 > this.
53 >
54 > > >
55 > > > - By placing the metadata at the end of file, we make it rather hard to
56 > > > read the metadata from remote location (via FTP, HTTP) without fetching
57 > > > the whole file. [NB: it's technically possible but probably not worth
58 > > > the effort]
59 > >
60 > >
61 > > > - By requiring the custom format to be at the end of file, we make it
62 > > > impossible to trivially cover it with a OpenPGP signature without
63 > > > introducing another custom format.
64 > > >
65 > >
66 > > Its trivial to cover with a detached sig, no?
67 > >
68 > >
69 > > >
70 > > > - While the format might allow for some extensibility, it's rather
71 > > > evolutionary dead end.
72 > > >
73 > >
74 > > I'm not even sure how to quantify this, it just sounds like your
75 > subjective
76 > > opinion (which is fine, but its not factual.)
77 > >
78 > >
79 > > >
80 > > >
81 > > > I think the key points of the new format should be:
82 > > >
83 > > > 1. It should reuse common file formats as much as possible, with
84 > > > inventing as little custom code as possible.
85 > > >
86 > > > 2. It should allow for easy introspection and editing by users without
87 > > > dedicated tools.
88 > > >
89 > >
90 > > So I'm less confident in the editing use cases; do users edit their
91 > binpkgs
92 > > on a regular basis?
93 >
94 > It's useful for debugging stuff. I had to use hexedit on xpak
95 > in the past. Believe me, it's nowhere close to pleasant.
96 >
97
98 > > > 3. The metadata should allow for lookup without fetching the whole
99 > > > binary package.
100 > > >
101 > > > 4. The format should allow for some extensions without having to
102 > > > reinvent the wheel every time.
103 > > >
104 > > > 5. It would be nice to preserve the existing advantages.
105 > > >
106 > > >
107 > > > My proposal
108 > > > ===========
109 > > >
110 > > > Basic format
111 > > > ------------
112 > > > The base of the format is a regular compressed tarball. There's no
113 > junk
114 > > > appended to it but the metadata is stored inside it as
115 > > > /var/db/pkg/${PF}. The contents are as compatible with the actual vdb
116 > > > format as possible.
117 > > >
118 > >
119 > > Just to clarify, you are suggesting we store the metadata inside the
120 > > contents of the binary package itself (e.g. where the other files that
121 > get
122 > > merged to the liveFS are?) What about collisions?
123 > >
124 > > E.g. I install 'machine-images/gentoo-disk-image-1.2.3' on a machine that
125 > > already has 'machine-images/gentoo-disk-image-1.2.3' installed, won't it
126 > > overwrite files in the VDB at qmerge time?
127 >
128 > Portage will obviously move the files out, and process them as metadata.
129 > The idea is to precisely use a directory that can't be normally part
130 > of binary packages, so can't cause collisions with real files (even if
131 > they're very unlikely to ever happen).
132 >
133 > > > This has the following advantages:
134 > > >
135 > > > + Binary package is still stored as a single file.
136 > > >
137 > > > + It uses a standard compressed .tar format, with minimal
138 > customization.
139 > > >
140 > > > + The user can easily inspect and modify the packages with standard
141 > > > tools (tar and the compressor).
142 > > >
143 > > > + If we can maintain reasonable level of vdb compatibility, the user
144 > can
145 > > > even emergency-install a package without causing too much hassle (as it
146 > > > will be recorded in vdb); ideally Portage would detect this vdb entry
147 > > > and support fixing the install afterwards.
148 > > >
149 > >
150 > > I'm not certain this is really desired.
151 >
152 > Are you saying it's better that user emergency-installs a package
153 > without recording it in vdb, and ends up with mess of collisions
154 > and untracked files?
155 >
156
157 > Just because you don't like some use case doesn't mean it's not gonna
158 > happen. Either you prepare for it and make the best of it, or you
159 > pretend it's not gonna happen and cause extra pain to users.
160 >
161
162 I would argue that I would split the requirements into 3 bands.
163
164 1) Must do
165 2) Should do
166 3) Nice to have
167
168 To me, manually unpacking a tarball and having it recorded in the VDB is a
169 'nice to have' feature. If we can make it work, great.
170 I tend to lean with Rich here that recording the data in-band is risky. I
171 think there is also this premise; that binpkgs can 'maintain VDB
172 compatibility'. Like I could make a binpkg, wait 2 years, then install it;
173 and we have to make sure that everything still works.
174
175 IMHO its a pretty high cost to pay (tight coupling) for what, to me, is a
176 nice to have feature.
177
178
179 >
180 > > > Optimizing for easy recognition
181 > > > -------------------------------
182 > > > In order to make it possible for magic-based tools such as file(1) to
183 > > > easily distinguish Gentoo binary packages from regular tarballs, we
184 > > > could (ab)use the volume label field, e.g. use:
185 > > >
186 > > > $ tar -V 'gpkg: app-foo/bar-1' -c ...
187 > > >
188 > > > This will add a volume label as the first file entry inside the
189 > tarball,
190 > > > which does not affect extracting but can be trivially matched via magic
191 > > > rules.
192 > > >
193 > > > Note: this is meant to be used as a method for fast binary package
194 > > > recognition; I don't think we should reject (hand-modified) binary
195 > > > packages that lack this label.
196 > > >
197 > > >
198 > > > Optimizing for metadata reading/manipulation performance
199 > > > --------------------------------------------------------
200 > > > The main problem with using a single tarball for both metadata and data
201 > > > is that normally you'd have to decompress everything to reliably unpack
202 > > > metadata, and recompress everything to update it. This problem can be
203 > > > addressed by a few optimization tricks.
204 > > >
205 > >
206 > > These performance goals seem a little bit ill defined.
207 > >
208 > > 1) Where are users reporting slowness in binpkg operations?
209 > > 2) What is the cause of the slowness?
210 >
211 > Those are optimizations to not cause slowness compared to the current
212 > format. Main use case is recreating package index which would require
213 > rereading the metadata of all binary packages.
214
215
216 > > Like I could easily see a potential user with many large binpkgs, and the
217 > > current implementation causing them issues because
218 > > they have to decompress and seek a bunch to read the metadata out of
219 > their
220 > > 1.2GB binpkg. But i'm pretty sure this isn't most users.
221 > >
222 > >
223 > > >
224 > > > Firstly, all metadata files are packed to the archive before data
225 > files.
226 > > > With a slightly customized unpacker, we can stop decompressing as soon
227 > > > as we're past metadata and avoid decompressing the whole archive. This
228 > > > will also make it possible to read metadata from remote files without
229 > > > fetching far past the compressed metadata block.
230 > > >
231 > >
232 > > So this seems to basically go against your goals of simple common
233 > tooling?
234 >
235 > No. My goal is to make it compatible with simple common tooling. You
236 > can still use the simple tooling to read/write them. The optimized
237 > tools are only needed to efficiently handle special use cases.
238 >
239
240 > > > Secondly, if we're up for some more tricks, we could technically split
241 > > > the tarball into metadata and data blocks compressed separately. This
242 > > > will need a bit of archiver customization but it will make it possible
243 > > > to decompress the metadata part without even touching compressed data,
244 > > > and to replace it without recompressing data.
245 > > >
246 > > > What's important is that both tricks proposed maintain backwards
247 > > > compatibility with regular compressed tarballs. That is, the user will
248 > > > still be able to extract it with regular archiving tools.
249 > >
250 > >
251 > > So my recollection is that debian uses common format AR files for the
252 > main
253 > > deb.
254 > > Then they have 2 compressed tarballs, one for metadata, and one for data.
255 > >
256 > > This format seems to jive with many of your requirements:
257 > >
258 > > - 'ar' can retrieve individual files from the archive.
259 > > - The deb file itself is not compressed, but the tarballs inside *are*
260 > > compressed.
261 > > - The metadata and data are compressed separately.
262 > > - Anyone can edit this with normal tooling (ar, tar)
263 > >
264 > > In short; why should we event a new format?
265 >
266 > Because nobody knows how to use 'ar', compared to how almost every
267 > Gentoo user can use 'tar' immediately? Of course we could alternatively
268 > just use a nested tarball but I wanted to keep the possibility
269 > of actually being able to 'tar -xf' it without having to extract nested
270 > archives.
271 >
272
273 I think man 'ar' could help them pretty easily. That being said I'm not wed
274 to 'ar', but trying to show how this problem was solved in a similar domain.
275
276
277 >
278 > --
279 > Best regards,
280 > Michał Górny
281 >