Gentoo Archives: gentoo-portage-dev

From: "Michał Górny" <mgorny@g.o>
To: gentoo-portage-dev@l.g.o
Subject: Re: [gentoo-portage-dev] [RFC] Improving Gentoo package format
Date: Sun, 11 Nov 2018 08:29:23
Message-Id: 1541924954.1150.8.camel@gentoo.org
In Reply to: Re: [gentoo-portage-dev] [RFC] Improving Gentoo package format by Alec Warner
1 On Sat, 2018-11-10 at 09:37 -0500, Alec Warner wrote:
2 > On Sat, Nov 10, 2018 at 8:09 AM Michał Górny <mgorny@g.o> wrote:
3 >
4 > > Hi, everyone.
5 > >
6 > > The Gentoo's tbz2/xpak package format is quite old. We've made a few
7 > > incompatible changes in the past (most notably, allowing non-bzip2
8 > > compression and multi-instance naming) but the core design stayed
9 > > the same. I think we should consider changing it, for the reasons
10 > > outlined below.
11 > >
12 > > The rough format description can be found in xpak(5). Basically, it's
13 > > a regular compressed tarball with binary metadata blob appended
14 > > to the end. As such, it looks like a regular compressed tarball
15 > > to the compression tools (with some ignored junk at the end).
16 > > The metadata is entirely custom format and needs dedicated tools
17 > > to manipulate.
18 > >
19 > >
20 > > The current format has a few advantages whose preserving would probably
21 > > be worthwhile:
22 > >
23 > > + The binary package is a single flat file.
24 > >
25 > > + It is reasonably compatible with regular compressed tarball,
26 > > so the users can unpack it using standard tools (except for metadata).
27 > >
28 > > + The metadata is uncompressed and can be quickly found without touching
29 > > the compressed data.
30 > >
31 > > + The metadata can be updated (e.g. as result of pkgmove) without
32 > > touching the compressed data.
33 > >
34 > >
35 > > However, it has a few disadvantages as well:
36 > >
37 > > - The metadata is entirely custom binary format, requiring dedicated
38 > > tools to read or edit.
39 > >
40 > > - The metadata format is relying on customary behavior of compression
41 > > tools that ignore junk following the compressed data.
42 > >
43 >
44 > I agree this is a problem in theory, but I haven't seen it as a problem in
45 > practice. Have you observed any problems around this setup?
46
47 Historically one of the parallel compressor variants did not support
48 this.
49
50 > >
51 > > - By placing the metadata at the end of file, we make it rather hard to
52 > > read the metadata from remote location (via FTP, HTTP) without fetching
53 > > the whole file. [NB: it's technically possible but probably not worth
54 > > the effort]
55 >
56 >
57 > > - By requiring the custom format to be at the end of file, we make it
58 > > impossible to trivially cover it with a OpenPGP signature without
59 > > introducing another custom format.
60 > >
61 >
62 > Its trivial to cover with a detached sig, no?
63 >
64 >
65 > >
66 > > - While the format might allow for some extensibility, it's rather
67 > > evolutionary dead end.
68 > >
69 >
70 > I'm not even sure how to quantify this, it just sounds like your subjective
71 > opinion (which is fine, but its not factual.)
72 >
73 >
74 > >
75 > >
76 > > I think the key points of the new format should be:
77 > >
78 > > 1. It should reuse common file formats as much as possible, with
79 > > inventing as little custom code as possible.
80 > >
81 > > 2. It should allow for easy introspection and editing by users without
82 > > dedicated tools.
83 > >
84 >
85 > So I'm less confident in the editing use cases; do users edit their binpkgs
86 > on a regular basis?
87
88 It's useful for debugging stuff. I had to use hexedit on xpak
89 in the past. Believe me, it's nowhere close to pleasant.
90
91 > > 3. The metadata should allow for lookup without fetching the whole
92 > > binary package.
93 > >
94 > > 4. The format should allow for some extensions without having to
95 > > reinvent the wheel every time.
96 > >
97 > > 5. It would be nice to preserve the existing advantages.
98 > >
99 > >
100 > > My proposal
101 > > ===========
102 > >
103 > > Basic format
104 > > ------------
105 > > The base of the format is a regular compressed tarball. There's no junk
106 > > appended to it but the metadata is stored inside it as
107 > > /var/db/pkg/${PF}. The contents are as compatible with the actual vdb
108 > > format as possible.
109 > >
110 >
111 > Just to clarify, you are suggesting we store the metadata inside the
112 > contents of the binary package itself (e.g. where the other files that get
113 > merged to the liveFS are?) What about collisions?
114 >
115 > E.g. I install 'machine-images/gentoo-disk-image-1.2.3' on a machine that
116 > already has 'machine-images/gentoo-disk-image-1.2.3' installed, won't it
117 > overwrite files in the VDB at qmerge time?
118
119 Portage will obviously move the files out, and process them as metadata.
120 The idea is to precisely use a directory that can't be normally part
121 of binary packages, so can't cause collisions with real files (even if
122 they're very unlikely to ever happen).
123
124 > > This has the following advantages:
125 > >
126 > > + Binary package is still stored as a single file.
127 > >
128 > > + It uses a standard compressed .tar format, with minimal customization.
129 > >
130 > > + The user can easily inspect and modify the packages with standard
131 > > tools (tar and the compressor).
132 > >
133 > > + If we can maintain reasonable level of vdb compatibility, the user can
134 > > even emergency-install a package without causing too much hassle (as it
135 > > will be recorded in vdb); ideally Portage would detect this vdb entry
136 > > and support fixing the install afterwards.
137 > >
138 >
139 > I'm not certain this is really desired.
140
141 Are you saying it's better that user emergency-installs a package
142 without recording it in vdb, and ends up with mess of collisions
143 and untracked files?
144
145 Just because you don't like some use case doesn't mean it's not gonna
146 happen. Either you prepare for it and make the best of it, or you
147 pretend it's not gonna happen and cause extra pain to users.
148
149 > > Optimizing for easy recognition
150 > > -------------------------------
151 > > In order to make it possible for magic-based tools such as file(1) to
152 > > easily distinguish Gentoo binary packages from regular tarballs, we
153 > > could (ab)use the volume label field, e.g. use:
154 > >
155 > > $ tar -V 'gpkg: app-foo/bar-1' -c ...
156 > >
157 > > This will add a volume label as the first file entry inside the tarball,
158 > > which does not affect extracting but can be trivially matched via magic
159 > > rules.
160 > >
161 > > Note: this is meant to be used as a method for fast binary package
162 > > recognition; I don't think we should reject (hand-modified) binary
163 > > packages that lack this label.
164 > >
165 > >
166 > > Optimizing for metadata reading/manipulation performance
167 > > --------------------------------------------------------
168 > > The main problem with using a single tarball for both metadata and data
169 > > is that normally you'd have to decompress everything to reliably unpack
170 > > metadata, and recompress everything to update it. This problem can be
171 > > addressed by a few optimization tricks.
172 > >
173 >
174 > These performance goals seem a little bit ill defined.
175 >
176 > 1) Where are users reporting slowness in binpkg operations?
177 > 2) What is the cause of the slowness?
178
179 Those are optimizations to not cause slowness compared to the current
180 format. Main use case is recreating package index which would require
181 rereading the metadata of all binary packages.
182
183 > Like I could easily see a potential user with many large binpkgs, and the
184 > current implementation causing them issues because
185 > they have to decompress and seek a bunch to read the metadata out of their
186 > 1.2GB binpkg. But i'm pretty sure this isn't most users.
187 >
188 >
189 > >
190 > > Firstly, all metadata files are packed to the archive before data files.
191 > > With a slightly customized unpacker, we can stop decompressing as soon
192 > > as we're past metadata and avoid decompressing the whole archive. This
193 > > will also make it possible to read metadata from remote files without
194 > > fetching far past the compressed metadata block.
195 > >
196 >
197 > So this seems to basically go against your goals of simple common tooling?
198
199 No. My goal is to make it compatible with simple common tooling. You
200 can still use the simple tooling to read/write them. The optimized
201 tools are only needed to efficiently handle special use cases.
202
203 > > Secondly, if we're up for some more tricks, we could technically split
204 > > the tarball into metadata and data blocks compressed separately. This
205 > > will need a bit of archiver customization but it will make it possible
206 > > to decompress the metadata part without even touching compressed data,
207 > > and to replace it without recompressing data.
208 > >
209 > > What's important is that both tricks proposed maintain backwards
210 > > compatibility with regular compressed tarballs. That is, the user will
211 > > still be able to extract it with regular archiving tools.
212 >
213 >
214 > So my recollection is that debian uses common format AR files for the main
215 > deb.
216 > Then they have 2 compressed tarballs, one for metadata, and one for data.
217 >
218 > This format seems to jive with many of your requirements:
219 >
220 > - 'ar' can retrieve individual files from the archive.
221 > - The deb file itself is not compressed, but the tarballs inside *are*
222 > compressed.
223 > - The metadata and data are compressed separately.
224 > - Anyone can edit this with normal tooling (ar, tar)
225 >
226 > In short; why should we event a new format?
227
228 Because nobody knows how to use 'ar', compared to how almost every
229 Gentoo user can use 'tar' immediately? Of course we could alternatively
230 just use a nested tarball but I wanted to keep the possibility
231 of actually being able to 'tar -xf' it without having to extract nested
232 archives.
233
234 --
235 Best regards,
236 Michał Górny

Attachments

File name MIME type
signature.asc application/pgp-signature

Replies