1 |
Hi, |
2 |
|
3 |
Here's a pre-GLEP draft based on the earlier discussion on gentoo- |
4 |
portage-dev mailing list. The specification uses GLEP form as it |
5 |
provides for cleanly specifying the motivation and rationale. |
6 |
|
7 |
(Note: the number assignment is not official, just took the next number |
8 |
to satisfy the glep converter script) |
9 |
|
10 |
Also available via HTTPS: |
11 |
|
12 |
rst: https://dev.gentoo.org/~mgorny/tmp/glep-0078.rst |
13 |
html: https://dev.gentoo.org/~mgorny/tmp/glep-0078.html |
14 |
|
15 |
--- |
16 |
GLEP: 78 |
17 |
Title: Gentoo binary package container format |
18 |
Author: Michał Górny <mgorny@g.o> |
19 |
Type: Standards Track |
20 |
Status: Draft |
21 |
Version: 1 |
22 |
Created: 2018-11-15 |
23 |
Last-Modified: 2018-11-16 |
24 |
Post-History: 2018-11-17 |
25 |
Content-Type: text/x-rst |
26 |
--- |
27 |
|
28 |
Abstract |
29 |
======== |
30 |
|
31 |
This GLEP proposes a new binary package container format for Gentoo. |
32 |
The current tbz2/XPAK format is shortly described, and its deficiences |
33 |
are listed. Accordingly, the requirements for a new format are set |
34 |
and a gpkg format satisfying them is proposed. The rationale for |
35 |
various design decisions is provided. |
36 |
|
37 |
|
38 |
Motivation |
39 |
========== |
40 |
|
41 |
The current Portage binary package format |
42 |
----------------------------------------- |
43 |
|
44 |
The historical ``.tbz2`` binary package format used by Portage is |
45 |
a concatenation of two distinct formats: header-oriented compressed .tar |
46 |
format (used to hold package files) and trailer-oriented custom XPAK |
47 |
format (used to hold metadata) [#MAN-XPAK]_. The format has already |
48 |
been extended incompatibly twice. |
49 |
|
50 |
The first time, support for storing multiple successive builds of binary |
51 |
package for a single ebuild version has been added. This feature relies |
52 |
on appending additional hyphen, followed by an integer to the package |
53 |
filename. It is disabled by default (preserving backwards |
54 |
compatibility) and controlled by ``binpkg-multi-instance`` feature. |
55 |
|
56 |
The second time, support for additional compression formats has been |
57 |
added. When format other than bzip2 is used, the ``.tbz2`` suffix |
58 |
is replaced by ``.xpak`` and Portage relies on magic bytes to detect |
59 |
compression used. For backwards compatibility, Portage still defaults |
60 |
to using bzip2; compression program can be switched using |
61 |
``BINPKG_COMPRESS`` configuration variable. |
62 |
|
63 |
Additionally, there have been minor changes to the stored metadata |
64 |
and file storage policies. In particular, behavior regarding |
65 |
``INSTALL_MASK``, controllable file compression and stripping has |
66 |
changed over time. |
67 |
|
68 |
|
69 |
Problems with the current binary package format |
70 |
----------------------------------------------- |
71 |
|
72 |
The following problems were identified with the package format currently |
73 |
in use: |
74 |
|
75 |
1. **The packages rely on custom binary archive format to store |
76 |
metadata.** It is entirely Gentoo invented, and requires dedicated |
77 |
tooling to work with it. In fact, the reference implementation |
78 |
in Portage does not even include a CLI tool to work with tbz2 |
79 |
packages; an unofficial implementation is provided as part |
80 |
of portage-utils toolkit [#PORTAGE-UTILS]_. |
81 |
|
82 |
2. **The format relies on obscure compressor feature of ignoring |
83 |
trailing garbage**. While this behavior is traditionally implemented |
84 |
by many compressors, the original reasons for it have become long |
85 |
irrelevant and it is not surprising that new compressors do not |
86 |
support it. In particular, Portage already hit this problem twice: |
87 |
once when users replaced bzip2 with parallel-capable pbzip2 |
88 |
implementation [#PBZIP2]_, and the second time when support for zstd |
89 |
compressor was added [#ZSTD]_. |
90 |
|
91 |
3. **Placing metadata at the end of file makes partial fetches |
92 |
complex.** While it is technically possible to obtain package |
93 |
metadata remotely without fetching the whole package, it usually |
94 |
requires e.g. 2-3 HTTP requests with rather complex driver. For |
95 |
comparison, if metadata was placed at the beginning of the file, |
96 |
early-terminated pipeline with a single fetch request would suffice. |
97 |
|
98 |
4. **Extending the format with OpenPGP signatures is non-trivial.** |
99 |
Depending on the implementation details, it either requires fetching |
100 |
additional detached signature, breaking backwards compatibility or |
101 |
introducing more custom logic to reassemble OpenPGP packets. |
102 |
|
103 |
5. **Metadata is not compressed.** This is not a significant problem, |
104 |
it is just listed for completeness. |
105 |
|
106 |
|
107 |
Goals for a new container format |
108 |
-------------------------------- |
109 |
|
110 |
The following goals have been set for a replacement format: |
111 |
|
112 |
1. **The packages must remain contained in a single file.** As a matter |
113 |
of user convenience, it should be possible to transfer binary |
114 |
packages without having to use multiple files, and to install them |
115 |
from any location. |
116 |
|
117 |
2. **The file format must be entirely based on common file formats, |
118 |
respecting best practices, with as little customization as necessary |
119 |
to satisfy the requirements.** In particular, it is unacceptable |
120 |
to create new binary formats. |
121 |
|
122 |
3. **The file format should provide for partial fetching of binary |
123 |
packages.** It should be possible to easily fetch and read |
124 |
the package metadata without having to download the whole package. |
125 |
|
126 |
4. **The file format must provide support for OpenPGP signatures.** |
127 |
Preferably, it should use standard OpenPGP message formats. |
128 |
|
129 |
5. **The file format must allow for efficient metadata updates.** |
130 |
In particular, it should be possible to update the metadata without |
131 |
having to recompress package files. |
132 |
|
133 |
6. **The file format should account for easy recognition both through |
134 |
filename and through contents.** Preferably, it should have distinct |
135 |
features making it possible to detect it via file(1). |
136 |
|
137 |
7. **The file format should allow for metadata compression.** |
138 |
|
139 |
8. **The file format should make future extensions easily possible |
140 |
without breaking backwards compatibility.** |
141 |
|
142 |
|
143 |
Specification |
144 |
============= |
145 |
|
146 |
The container format |
147 |
-------------------- |
148 |
|
149 |
The gpkg package container is an uncompressed .tar achive whose filename |
150 |
uses ``.gpkg.tar`` suffix. This archive contains the following members, |
151 |
in order: |
152 |
|
153 |
1. A volume label: ``gpkg: ${full_package_identifier}`` (optional). |
154 |
|
155 |
2. A signature for the metadata archive: ``metadata.tar${comp}.sig`` |
156 |
(optional). |
157 |
|
158 |
3. The metadata archive ``metadata.tar${comp}``, optionally compressed |
159 |
(required). |
160 |
|
161 |
4. A signature for the filesystem image archive: |
162 |
``image.tar${comp}.sig`` (optional). |
163 |
|
164 |
5. The filesystem image archive ``image.tar${comp}``, optionally |
165 |
compressed (required). |
166 |
|
167 |
It is recommended that relative order of the archive members is |
168 |
preserved. However, implementations must support archives with members |
169 |
out of order. |
170 |
|
171 |
The container may be extended with additional members in the future. |
172 |
The implementations should ignore unrecognized members and preserve |
173 |
them across package updates. |
174 |
|
175 |
|
176 |
The volume label |
177 |
---------------- |
178 |
|
179 |
The volume label provides an easy way for users to identify the binary |
180 |
package without dedicated tooling or specific format knowledge. |
181 |
|
182 |
The implementations should include a volume label consisting of fixed |
183 |
string ``gpkg:``, followed by a single space, followed by full package |
184 |
identifier. However, the implementations must not rely on the volume |
185 |
label being present or attempt to parse its value when it is. |
186 |
|
187 |
Furthermore, since the volume label is included in the .tar archive |
188 |
as the first member, it provides a magic string at a fixed location |
189 |
that can be used by tools such as file(1) to easily distinguish Gentoo |
190 |
binary packages from regular .tar archives. |
191 |
|
192 |
|
193 |
The metadata archive |
194 |
-------------------- |
195 |
|
196 |
The metadata archive stores the package metadata needed for the package |
197 |
manager to process it. The archive should be included at the beginning |
198 |
of the binary package in order to make it possible to read it out of |
199 |
partially fetched binary package, and to avoid fetching the remaining |
200 |
part of the package if not necessary. |
201 |
|
202 |
The archive contains a single directory called ``metadata``. In this |
203 |
directory, the individual metadata keys are stored as files. The exact |
204 |
keys and metadata format is outside the scope of this specification. |
205 |
|
206 |
The package manager may need to modify the package metadata. In this |
207 |
case, it should replace the metadata archive without having to alter |
208 |
other package members. |
209 |
|
210 |
The metadata archive can optionally be compressed. It can also be |
211 |
supplemented with a detached OpenPGP signature. |
212 |
|
213 |
|
214 |
The image archive |
215 |
----------------- |
216 |
|
217 |
The image archive stores all the files to be installed by the binary |
218 |
package. It should be included as the last of the files in the binary |
219 |
package container. |
220 |
|
221 |
The archive contains a single directory called ``image``. Inside this |
222 |
directory, all package files are stored in filesystem layout, relative |
223 |
to the root directory. |
224 |
|
225 |
The image archive can optionally be compressed. It can also be |
226 |
supplemented with a detached OpenPGP signature. |
227 |
|
228 |
|
229 |
Archive member compression |
230 |
-------------------------- |
231 |
|
232 |
The archive members outlined above support optional compression using |
233 |
one of the compressed file formats supported by the package manager. |
234 |
The exact list of compression types is outside the scope of this |
235 |
specification. |
236 |
|
237 |
The implementations must support archive members being uncompressed, |
238 |
and must support using different compression types for different files. |
239 |
|
240 |
When compressing an archive member, the member filename should be |
241 |
suffixed using the standard suffix for the particular compressed file |
242 |
type (e.g. ``.bz2`` for bzip2 format). |
243 |
|
244 |
|
245 |
OpenPGP member signatures |
246 |
------------------------- |
247 |
|
248 |
The archive members support optional OpenPGP signatures. |
249 |
The implementations must allow the user to specify whether OpenPGP |
250 |
signatures are to be expected in remotely fetched packages. |
251 |
|
252 |
If the signatures are expected and the archive member is unsigned, the |
253 |
package manager must reject processing it. If the signature does not |
254 |
verify, the package manager must reject processing the corresponding |
255 |
archive member. In particular, it must not attempt decompressing |
256 |
compressed members in those circumstances. |
257 |
|
258 |
If the implementation needs to manipulate archive members, it must |
259 |
either create a new signature or discard the existing signature. |
260 |
|
261 |
The signatures are created as binary detached OpenPGP signature files, |
262 |
with filename corresponding to the member filename with ``.sig`` suffix |
263 |
appended. |
264 |
|
265 |
|
266 |
Rationale |
267 |
========= |
268 |
|
269 |
Nested archive format |
270 |
--------------------- |
271 |
|
272 |
The basic problem in designing the new format was how to embed multiple |
273 |
data streams (metadata, image) into a single file. Traditionally, this |
274 |
has been done via using two non-conflicting file formats. However, |
275 |
while such a solution is clever, it suffers in terms of transparency. |
276 |
|
277 |
Therefore, it has been established that the new format should really |
278 |
consist of a single archive format, with all necessary data |
279 |
transparently accessible inside the file. Consequently, it has been |
280 |
debated how different parts of binary package data should be stored |
281 |
inside that archive. |
282 |
|
283 |
The proposal to continue storing image data as top-level data |
284 |
in the package format, and store metadata as special directory in that |
285 |
structure has been discarded as a case of in-band signalling. |
286 |
|
287 |
Finally, the proposal has been shaped to store different kinds of data |
288 |
as nested archives in the outer binary package container. Besides |
289 |
providing a clean way of accessing different kinds of information, it |
290 |
makes it possible to add separate OpenPGP signatures to them. |
291 |
|
292 |
|
293 |
Inner vs. outer compression |
294 |
--------------------------- |
295 |
|
296 |
One of the points in the new format debate was whether the binary |
297 |
package as a whole should be compressed vs. compressing individual |
298 |
members. The first option may seem as an obvious choice, especially |
299 |
given that with a larger data set, the compression may proceed more |
300 |
effectively. However, it has a single strong disadvantage: compression |
301 |
prevents random access and manipulation of the binary package members. |
302 |
|
303 |
While for the purpose of reading binary packages, the problem could be |
304 |
circumvented through convenient member ordering and avoiding disjoint |
305 |
reads of the binary package, metadata updates would either require |
306 |
recompressing the whole package (which could be really time consuming |
307 |
with large packages) or applying complex techniques such as splitting |
308 |
the compressed archive into multiple compressed streams. |
309 |
|
310 |
This considered, the simplest solution is to apply compression to |
311 |
the individual package members, while leaving the container format |
312 |
uncompressed. It provides fast random access to the individual members, |
313 |
as well as capability of updating them without the necessity of |
314 |
recompressing other files in the container. |
315 |
|
316 |
This also makes it possible to easily protect compressed files using |
317 |
standard OpenPGP detached signature format. All this combined, |
318 |
the package manager may perform partial fetch of binary package, verify |
319 |
the signature of its metadata member and process it without having to |
320 |
fetch the potentially-large image part. |
321 |
|
322 |
|
323 |
Container and archive formats |
324 |
----------------------------- |
325 |
|
326 |
During the debate, the actual archive formats to use were considered. |
327 |
The .tar format seemed an obvious choice for the image archive since |
328 |
it is the only widely deployed archive format that stores all kinds |
329 |
of file metadata on POSIX systems. However, multiple options for |
330 |
the outer format has been debated. |
331 |
|
332 |
Firstly, the ZIP format has been proposed as the only commonly supported |
333 |
format supporting adding files from stdin (i.e. making it possible to |
334 |
pipe the inner archives straight into the container without using |
335 |
temporary files). However, this format has been clearly rejected |
336 |
as both not being present in the system set, and being trailer-based |
337 |
and therefore unusable without having to fetch the whole file. |
338 |
|
339 |
Secondly, the ar and cpio formats were considered. The former is used |
340 |
by Debian and its derivative binary packages; the latter is used by Red |
341 |
Hat derivatives. Both formats have the advantage of having less |
342 |
historical baggage than .tar, and having less overhead. However, both |
343 |
are also rather obscure (especially given that ar is actually provided |
344 |
by GNU binutils rather than as a stand-alone archiver), considered |
345 |
obsolete by POSIX and both have file size limitations smaller than .tar. |
346 |
|
347 |
All that considered, it has been decided that there is no purpose |
348 |
in using a second archive format in the specification unless it has |
349 |
significant advantage to .tar. Therefore, .tar has also been used |
350 |
as outer package format, even though it has larger overhead than other |
351 |
formats (mostly due to padding). |
352 |
|
353 |
|
354 |
Member ordering |
355 |
--------------- |
356 |
|
357 |
The member ordering is explicitly specified in order to provide for |
358 |
trivially reading metadata from partially fetched archives. |
359 |
By requiring the metadata archive to be stored before the image archive, |
360 |
the package manager may stop fetching after reading it and save |
361 |
bandwidth and/or space. |
362 |
|
363 |
|
364 |
Detached OpenPGP signatures |
365 |
--------------------------- |
366 |
|
367 |
The use of detached OpenPGP signatures is to provide authenticity checks |
368 |
for binary packages. Covering the complete members with signatures |
369 |
provide for trivial verification of all metadata and image contents |
370 |
respectively, without having to invent custom mechanisms for combining |
371 |
them. Covering the compressed archives helps to prevent zipbomb |
372 |
attacks. Covering the individual members rather than the whole package |
373 |
provides for verification of partially fetched binary packages. |
374 |
|
375 |
|
376 |
Backwards Compatibility |
377 |
======================= |
378 |
|
379 |
The format does not preserve backwards compatibility with the tbz2 |
380 |
packages. It has been established that preserving compatibility with |
381 |
the old format was impossible without making the new format even worse |
382 |
than the old one was. |
383 |
|
384 |
For example, adding any visible members to the tarball would cause |
385 |
them to be installed to the filesystem by old Portage versions. Working |
386 |
around this would require some kind of awful hacks that would oppose |
387 |
the goal of using simple and transparent package format. |
388 |
|
389 |
|
390 |
Reference Implementation |
391 |
======================== |
392 |
|
393 |
The proof-of-concept implementation of binary package format converter |
394 |
is available as xpak2gpkg [#XPAK2GPKG]_. It can be used to easily |
395 |
create packages in the new format for early inspection. |
396 |
|
397 |
|
398 |
References |
399 |
========== |
400 |
|
401 |
.. [#MAN-XPAK] xpak - The XPAK Data Format used with Portage binary |
402 |
packages |
403 |
(https://dev.gentoo.org/~zmedico/portage/doc/man/xpak.5.html) |
404 |
|
405 |
.. [#PORTAGE-UTILS] portage-utils: Small and fast Portage helper tools |
406 |
written in C |
407 |
(https://packages.gentoo.org/packages/app-portage/portage-utils) |
408 |
|
409 |
.. [#PBZIP2] PBZIP2 - a parallel implementation of the bzip2 |
410 |
block-sorting file compressor |
411 |
(https://launchpad.net/pbzip2) |
412 |
|
413 |
.. [#ZSTD] Zstandard - Real-time data compression algorithm |
414 |
(https://facebook.github.io/zstd/) |
415 |
|
416 |
.. [#XPAK2GPKG] xpak2gpkg: Proof-of-concept converter from tbz2/xpak |
417 |
to gpkg binpkg format |
418 |
(https://github.com/mgorny/xpak2gpkg) |
419 |
|
420 |
|
421 |
Copyright |
422 |
========= |
423 |
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 |
424 |
Unported License. To view a copy of this license, visit |
425 |
http://creativecommons.org/licenses/by-sa/3.0/. |
426 |
|
427 |
-- |
428 |
Best regards, |
429 |
Michał Górny |