Gentoo Archives: gentoo-portage-dev

From: Zac Medico <zmedico@g.o>
To: gentoo-portage-dev@l.g.o, Alec Warner <antarus@g.o>, "Michał Górny" <mgorny@g.o>
Subject: Re: [gentoo-portage-dev] [RFC] Improving Gentoo package format
Date: Sun, 11 Nov 2018 05:46:47
Message-Id: f9fc2771-0916-2cae-7336-bcb5e338c103@gentoo.org
In Reply to: Re: [gentoo-portage-dev] [RFC] Improving Gentoo package format by Alec Warner
1 On 11/10/2018 06:37 AM, Alec Warner wrote:
2 >
3 > On Sat, Nov 10, 2018 at 8:09 AM Michał Górny <mgorny@g.o
4 > <mailto:mgorny@g.o>> wrote:
5 >
6 > Hi, everyone.
7 >
8 > The Gentoo's tbz2/xpak package format is quite old.  We've made a few
9 > incompatible changes in the past (most notably, allowing non-bzip2
10 > compression and multi-instance naming) but the core design stayed
11 > the same.  I think we should consider changing it, for the reasons
12 > outlined below.
13 >
14 > The rough format description can be found in xpak(5).  Basically, it's
15 > a regular compressed tarball with binary metadata blob appended
16 > to the end.  As such, it looks like a regular compressed tarball
17 > to the compression tools (with some ignored junk at the end).
18 > The metadata is entirely custom format and needs dedicated tools
19 > to manipulate.
20 >
21 >
22 > The current format has a few advantages whose preserving would probably
23 > be worthwhile:
24 >
25 > + The binary package is a single flat file.
26 >
27 > + It is reasonably compatible with regular compressed tarball,
28 > so the users can unpack it using standard tools (except for metadata).
29 >
30 > + The metadata is uncompressed and can be quickly found without touching
31 > the compressed data.
32 >
33 > + The metadata can be updated (e.g. as result of pkgmove) without
34 > touching the compressed data.
35 >
36 >
37 > However, it has a few disadvantages as well:
38 >
39 > - The metadata is entirely custom binary format, requiring dedicated
40 > tools to read or edit.
41 >
42 > - The metadata format is relying on customary behavior of compression
43 > tools that ignore junk following the compressed data.
44 >
45 >
46 > I agree this is a problem in theory, but I haven't seen it as a problem
47 > in practice. Have you observed any problems around this setup?
48
49 In portage we use head -c to selected the compressed data, since zstd
50 doesn't handle the xpak trailer well.
51
52 >
53 > - By placing the metadata at the end of file, we make it rather hard to
54 > read the metadata from remote location (via FTP, HTTP) without fetching
55 > the whole file.  [NB: it's technically possible but probably not worth
56 > the effort] 
57 >
58 >
59 > - By requiring the custom format to be at the end of file, we make it
60 > impossible to trivially cover it with a OpenPGP signature without
61 > introducing another custom format.
62 >
63 >
64 > Its trivial to cover with a detached sig, no?
65 >  
66 >
67 >
68 > - While the format might allow for some extensibility, it's rather
69 > evolutionary dead end.
70 >
71 >
72 > I'm not even sure how to quantify this, it just sounds like your
73 > subjective opinion (which is fine, but its not factual.)
74
75 Yeah the xpak trailer is flexible enough, but I'm not opposed to
76 supporting a different format.
77
78 >
79 > I think the key points of the new format should be:
80 >
81 > 1. It should reuse common file formats as much as possible, with
82 > inventing as little custom code as possible.
83 >
84 > 2. It should allow for easy introspection and editing by users without
85 > dedicated tools.
86 >
87 >
88 > So I'm less confident in the editing use cases; do users edit their
89 > binpkgs on a regular basis?
90
91 Yes, gentoo/profiles/updates package renames an slot moves are a form of
92 this.
93
94 >
95 > 3. The metadata should allow for lookup without fetching the whole
96 > binary package.
97 >
98 > 4. The format should allow for some extensions without having to
99 > reinvent the wheel every time.
100 >
101 > 5. It would be nice to preserve the existing advantages.
102 >
103 >
104 > My proposal
105 > ===========
106 >
107 > Basic format
108 > ------------
109 > The base of the format is a regular compressed tarball.  There's no junk
110 > appended to it but the metadata is stored inside it as
111 > /var/db/pkg/${PF}.  The contents are as compatible with the actual vdb
112 > format as possible.
113 >
114 >
115 > Just to clarify, you are suggesting we store the metadata inside the
116 > contents of the binary package itself (e.g. where the other files that
117 > get merged to the liveFS are?) What about collisions?
118 >
119 > E.g. I install 'machine-images/gentoo-disk-image-1.2.3' on a machine
120 > that already has 'machine-images/gentoo-disk-image-1.2.3' installed,
121 > won't it overwrite files in the VDB at qmerge time?
122
123 I haven't looked into it but maybe we can use "nil control directory
124 names" to embed things, like http://savannah.gnu.org/projects/swbis
125 claims to use.
126
127 > This has the following advantages:
128 >
129 > + Binary package is still stored as a single file.
130 >
131 > + It uses a standard compressed .tar format, with minimal customization.
132 >
133 > + The user can easily inspect and modify the packages with standard
134 > tools (tar and the compressor).
135 >
136 > + If we can maintain reasonable level of vdb compatibility, the user can
137 > even emergency-install a package without causing too much hassle (as it
138 > will be recorded in vdb); ideally Portage would detect this vdb entry
139 > and support fixing the install afterwards.
140 >
141 >
142 > I'm not certain this is really desired.
143
144 Yeah I don't like it either, I'd prefer to keep the metadata someplace
145 where it can't overwrite files in the installed package database.
146
147 >
148 > Optimizing for easy recognition
149 > -------------------------------
150 > In order to make it possible for magic-based tools such as file(1) to
151 > easily distinguish Gentoo binary packages from regular tarballs, we
152 > could (ab)use the volume label field, e.g. use:
153 >
154 >   $ tar -V 'gpkg: app-foo/bar-1' -c ...
155 >
156 > This will add a volume label as the first file entry inside the tarball,
157 > which does not affect extracting but can be trivially matched via magic
158 > rules.
159 >
160 > Note: this is meant to be used as a method for fast binary package
161 > recognition; I don't think we should reject (hand-modified) binary
162 > packages that lack this label.
163 >
164 >
165 > Optimizing for metadata reading/manipulation performance
166 > --------------------------------------------------------
167 > The main problem with using a single tarball for both metadata and data
168 > is that normally you'd have to decompress everything to reliably unpack
169 > metadata, and recompress everything to update it.  This problem can be
170 > addressed by a few optimization tricks.
171 >
172 >
173 > These performance goals seem a little bit ill defined.
174 >
175 > 1) Where are users reporting slowness in binpkg operations?
176 > 2) What is the cause of the slowness?
177
178 Yeah I'd like more information here too.
179
180 > Like I could easily see a potential user with many large binpkgs, and
181 > the current implementation causing them issues because
182 > they have to decompress and seek a bunch to read the metadata out of
183 > their 1.2GB binpkg. But i'm pretty sure this isn't most users.
184 >  
185 >
186 >
187 > Firstly, all metadata files are packed to the archive before data files.
188 >  With a slightly customized unpacker, we can stop decompressing as soon
189 > as we're past metadata and avoid decompressing the whole archive.  This
190 > will also make it possible to read metadata from remote files without
191 > fetching far past the compressed metadata block.
192 >
193 >
194 > So this seems to basically go against your goals of simple common tooling?
195 >  
196 >
197 >
198 > Secondly, if we're up for some more tricks, we could technically split
199 > the tarball into metadata and data blocks compressed separately.  This
200 > will need a bit of archiver customization but it will make it possible
201 > to decompress the metadata part without even touching compressed data,
202 > and to replace it without recompressing data.
203 >
204 > What's important is that both tricks proposed maintain backwards
205 > compatibility with regular compressed tarballs.  That is, the user will
206 > still be able to extract it with regular archiving tools.
207 >
208 >
209 > So my recollection is that debian uses common format AR files for the
210 > main deb.
211 > Then they have 2 compressed tarballs, one for metadata, and one for data.
212 >
213 > This format seems to jive with many of your requirements:
214 >
215 >  - 'ar' can retrieve individual files from the archive.
216 >  - The deb file itself is not compressed, but the tarballs inside *are*
217 > compressed.
218 >  - The metadata and data are compressed separately.
219 >  - Anyone can edit this with normal tooling (ar, tar)
220 >
221 > In short; why should we event a new format?
222
223 Maybe we can borrow some ideas from
224 http://savannah.gnu.org/projects/swbis which claims to be capable of
225 creating and verifying a tarball with GPG signatures embedded in the
226 tarball.
227
228 >
229 > Adding OpenPGP signatures
230 > -------------------------
231 > This is the main XXX here.
232 >
233 > Technically, the most obvious solution is to cover the entire tarball
234 > with OpenPGP signature.  However, this has the disadvantage that
235 > the verification requires fetching the whole file.
236 >
237 > I will look into possibility of having partial signatures.
238 >
239 >
240 > --
241 > Best regards,
242 > Michał Górny
243 >
244
245
246 --
247 Thanks,
248 Zac

Attachments

File name MIME type
signature.asc application/pgp-signature