Gentoo Archives: gentoo-portage-dev

From: Alec Warner <antarus@g.o>
To: gentoo-portage-dev@l.g.o
Subject: Re: [gentoo-portage-dev] [RFC] Improving Gentoo package format
Date: Sat, 10 Nov 2018 14:37:54
Message-Id: CAAr7Pr-7qO_k7_Xj6kC35+mUSG0-rpRm0kdm+hN=nm5gvoTwMg@mail.gmail.com
In Reply to: [gentoo-portage-dev] [RFC] Improving Gentoo package format by "Michał Górny"
1 On Sat, Nov 10, 2018 at 8:09 AM Michał Górny <mgorny@g.o> wrote:
2
3 > Hi, everyone.
4 >
5 > The Gentoo's tbz2/xpak package format is quite old. We've made a few
6 > incompatible changes in the past (most notably, allowing non-bzip2
7 > compression and multi-instance naming) but the core design stayed
8 > the same. I think we should consider changing it, for the reasons
9 > outlined below.
10 >
11 > The rough format description can be found in xpak(5). Basically, it's
12 > a regular compressed tarball with binary metadata blob appended
13 > to the end. As such, it looks like a regular compressed tarball
14 > to the compression tools (with some ignored junk at the end).
15 > The metadata is entirely custom format and needs dedicated tools
16 > to manipulate.
17 >
18 >
19 > The current format has a few advantages whose preserving would probably
20 > be worthwhile:
21 >
22 > + The binary package is a single flat file.
23 >
24 > + It is reasonably compatible with regular compressed tarball,
25 > so the users can unpack it using standard tools (except for metadata).
26 >
27 > + The metadata is uncompressed and can be quickly found without touching
28 > the compressed data.
29 >
30 > + The metadata can be updated (e.g. as result of pkgmove) without
31 > touching the compressed data.
32 >
33 >
34 > However, it has a few disadvantages as well:
35 >
36 > - The metadata is entirely custom binary format, requiring dedicated
37 > tools to read or edit.
38 >
39 > - The metadata format is relying on customary behavior of compression
40 > tools that ignore junk following the compressed data.
41 >
42
43 I agree this is a problem in theory, but I haven't seen it as a problem in
44 practice. Have you observed any problems around this setup?
45
46
47 >
48 > - By placing the metadata at the end of file, we make it rather hard to
49 > read the metadata from remote location (via FTP, HTTP) without fetching
50 > the whole file. [NB: it's technically possible but probably not worth
51 > the effort]
52
53
54 > - By requiring the custom format to be at the end of file, we make it
55 > impossible to trivially cover it with a OpenPGP signature without
56 > introducing another custom format.
57 >
58
59 Its trivial to cover with a detached sig, no?
60
61
62 >
63 > - While the format might allow for some extensibility, it's rather
64 > evolutionary dead end.
65 >
66
67 I'm not even sure how to quantify this, it just sounds like your subjective
68 opinion (which is fine, but its not factual.)
69
70
71 >
72 >
73 > I think the key points of the new format should be:
74 >
75 > 1. It should reuse common file formats as much as possible, with
76 > inventing as little custom code as possible.
77 >
78 > 2. It should allow for easy introspection and editing by users without
79 > dedicated tools.
80 >
81
82 So I'm less confident in the editing use cases; do users edit their binpkgs
83 on a regular basis?
84
85
86 >
87 > 3. The metadata should allow for lookup without fetching the whole
88 > binary package.
89 >
90 > 4. The format should allow for some extensions without having to
91 > reinvent the wheel every time.
92 >
93 > 5. It would be nice to preserve the existing advantages.
94 >
95 >
96 > My proposal
97 > ===========
98 >
99 > Basic format
100 > ------------
101 > The base of the format is a regular compressed tarball. There's no junk
102 > appended to it but the metadata is stored inside it as
103 > /var/db/pkg/${PF}. The contents are as compatible with the actual vdb
104 > format as possible.
105 >
106
107 Just to clarify, you are suggesting we store the metadata inside the
108 contents of the binary package itself (e.g. where the other files that get
109 merged to the liveFS are?) What about collisions?
110
111 E.g. I install 'machine-images/gentoo-disk-image-1.2.3' on a machine that
112 already has 'machine-images/gentoo-disk-image-1.2.3' installed, won't it
113 overwrite files in the VDB at qmerge time?
114
115
116 >
117 > This has the following advantages:
118 >
119 > + Binary package is still stored as a single file.
120 >
121 > + It uses a standard compressed .tar format, with minimal customization.
122 >
123 > + The user can easily inspect and modify the packages with standard
124 > tools (tar and the compressor).
125 >
126 > + If we can maintain reasonable level of vdb compatibility, the user can
127 > even emergency-install a package without causing too much hassle (as it
128 > will be recorded in vdb); ideally Portage would detect this vdb entry
129 > and support fixing the install afterwards.
130 >
131
132 I'm not certain this is really desired.
133
134
135 >
136 >
137 > Optimizing for easy recognition
138 > -------------------------------
139 > In order to make it possible for magic-based tools such as file(1) to
140 > easily distinguish Gentoo binary packages from regular tarballs, we
141 > could (ab)use the volume label field, e.g. use:
142 >
143 > $ tar -V 'gpkg: app-foo/bar-1' -c ...
144 >
145 > This will add a volume label as the first file entry inside the tarball,
146 > which does not affect extracting but can be trivially matched via magic
147 > rules.
148 >
149 > Note: this is meant to be used as a method for fast binary package
150 > recognition; I don't think we should reject (hand-modified) binary
151 > packages that lack this label.
152 >
153 >
154 > Optimizing for metadata reading/manipulation performance
155 > --------------------------------------------------------
156 > The main problem with using a single tarball for both metadata and data
157 > is that normally you'd have to decompress everything to reliably unpack
158 > metadata, and recompress everything to update it. This problem can be
159 > addressed by a few optimization tricks.
160 >
161
162 These performance goals seem a little bit ill defined.
163
164 1) Where are users reporting slowness in binpkg operations?
165 2) What is the cause of the slowness?
166
167 Like I could easily see a potential user with many large binpkgs, and the
168 current implementation causing them issues because
169 they have to decompress and seek a bunch to read the metadata out of their
170 1.2GB binpkg. But i'm pretty sure this isn't most users.
171
172
173 >
174 > Firstly, all metadata files are packed to the archive before data files.
175 > With a slightly customized unpacker, we can stop decompressing as soon
176 > as we're past metadata and avoid decompressing the whole archive. This
177 > will also make it possible to read metadata from remote files without
178 > fetching far past the compressed metadata block.
179 >
180
181 So this seems to basically go against your goals of simple common tooling?
182
183
184 >
185 > Secondly, if we're up for some more tricks, we could technically split
186 > the tarball into metadata and data blocks compressed separately. This
187 > will need a bit of archiver customization but it will make it possible
188 > to decompress the metadata part without even touching compressed data,
189 > and to replace it without recompressing data.
190 >
191 > What's important is that both tricks proposed maintain backwards
192 > compatibility with regular compressed tarballs. That is, the user will
193 > still be able to extract it with regular archiving tools.
194
195
196 So my recollection is that debian uses common format AR files for the main
197 deb.
198 Then they have 2 compressed tarballs, one for metadata, and one for data.
199
200 This format seems to jive with many of your requirements:
201
202 - 'ar' can retrieve individual files from the archive.
203 - The deb file itself is not compressed, but the tarballs inside *are*
204 compressed.
205 - The metadata and data are compressed separately.
206 - Anyone can edit this with normal tooling (ar, tar)
207
208 In short; why should we event a new format?
209
210
211 >
212 >
213 > Adding OpenPGP signatures
214 > -------------------------
215 > This is the main XXX here.
216 >
217 > Technically, the most obvious solution is to cover the entire tarball
218 > with OpenPGP signature. However, this has the disadvantage that
219 > the verification requires fetching the whole file.
220 >
221 > I will look into possibility of having partial signatures.
222 >
223 >
224 > --
225 > Best regards,
226 > Michał Górny
227 >

Replies