1 |
Hi, |
2 |
|
3 |
On Sat, 2018-11-17 at 12:21 +0100, Michał Górny wrote: |
4 |
> Here's a pre-GLEP draft based on the earlier discussion on gentoo- |
5 |
> portage-dev mailing list. The specification uses GLEP form as it |
6 |
> provides for cleanly specifying the motivation and rationale. |
7 |
|
8 |
Here's third iteration. Changes since r1: |
9 |
- removed unnecessary OpenPGP details, made them out of scope, |
10 |
- added explicit section on (lack of) versioning and how to recognize |
11 |
packages and their compatibility, |
12 |
- explained why squashfs is a no-go. |
13 |
|
14 |
|
15 |
--- |
16 |
GLEP: 9999 |
17 |
Title: Gentoo binary package container format |
18 |
Author: Michał Górny <mgorny@g.o> |
19 |
Type: Standards Track |
20 |
Status: Draft |
21 |
Version: 1 |
22 |
Created: 2018-11-15 |
23 |
Last-Modified: 2018-11-20 |
24 |
Post-History: 2018-11-17 |
25 |
Content-Type: text/x-rst |
26 |
--- |
27 |
|
28 |
Abstract |
29 |
======== |
30 |
|
31 |
This GLEP proposes a new binary package container format for Gentoo. |
32 |
The current tbz2/XPAK format is shortly described, and its deficiences |
33 |
are explained. Accordingly, the requirements for a new format are set |
34 |
and a gpkg format satisfying them is proposed. The rationale for |
35 |
the design decisions is provided. |
36 |
|
37 |
|
38 |
Motivation |
39 |
========== |
40 |
|
41 |
The current Portage binary package format |
42 |
----------------------------------------- |
43 |
|
44 |
The historical ``.tbz2`` binary package format used by Portage is |
45 |
a concatenation of two distinct formats: header-oriented compressed .tar |
46 |
format (used to hold package files) and trailer-oriented custom XPAK |
47 |
format (used to hold metadata) [#MAN-XPAK]_. The format has already |
48 |
been extended incompatibly twice. |
49 |
|
50 |
The first time, support for storing multiple successive builds of binary |
51 |
package for a single ebuild version has been added. This feature relies |
52 |
on appending additional hyphen, followed by an integer to the package |
53 |
filename. It is disabled by default (preserving backwards |
54 |
compatibility) and controlled by ``binpkg-multi-instance`` feature. |
55 |
|
56 |
The second time, support for additional compression formats has been |
57 |
added. When format other than bzip2 is used, the ``.tbz2`` suffix |
58 |
is replaced by ``.xpak`` and Portage relies on magic bytes to detect |
59 |
compression used. For backwards compatibility, Portage still defaults |
60 |
to using bzip2; compression program can be switched using |
61 |
``BINPKG_COMPRESS`` configuration variable. |
62 |
|
63 |
Additionally, there have been minor changes to the stored metadata |
64 |
and file storage policies. In particular, behavior regarding |
65 |
``INSTALL_MASK``, controllable file compression and stripping has |
66 |
changed over time. |
67 |
|
68 |
|
69 |
The advantages of tbz2/XPAK format |
70 |
---------------------------------- |
71 |
|
72 |
The tbz2/XPAK format used by Portage has three interesting features: |
73 |
|
74 |
1. **Each binary package is fully contained within a single file.** |
75 |
While this might seem unnecessary, it makes it easier for the user |
76 |
to transfer binary packages without having to be concerned about |
77 |
finding all the necessary files to transfer. |
78 |
|
79 |
2. **The binary packages are compatible with regular compressed |
80 |
tarballs, most of the time.** With notable exceptions of historical |
81 |
versions of pbzip2 and the recent zstd compressor, tbz2/XPAK packages |
82 |
can be extracted using regular tar utility with a compressor |
83 |
implementation that discards trailing garbage. |
84 |
|
85 |
3. **The metadata is uncompressed, and can be efficiently accessed |
86 |
without decompressing package contents.** This includes |
87 |
the possibility of rewriting it (e.g. as a result of package moves) |
88 |
without the necessity of repacking the files. |
89 |
|
90 |
|
91 |
Transparency problem with the current binary package format |
92 |
----------------------------------------------------------- |
93 |
|
94 |
Notwithstanding its advantages, the tbz2/XPAK format has a significant |
95 |
design fault that consists of two issues: |
96 |
|
97 |
1. **The XPAK format is a custom binary format with explicit use |
98 |
of binary-encoded file offsets and field lengths.** As such, it is |
99 |
non-trivial to read or edit without specialized tools. Such tools |
100 |
are currently implemented separately from the package manager, |
101 |
as part of the portage-utils toolkit, written in C [#PORTAGE-UTILS]_. |
102 |
|
103 |
2. **The tarball compatibility feature relies on obscure feature of |
104 |
ignoring trailing garbage in compressed files**. While this is |
105 |
implemented consistently in most of the compressors, this feature |
106 |
is not really a part of specification but rather traditional |
107 |
behavior. Given that the original reasons for this no longer apply, |
108 |
new compressor implementations are likely to miss support for this. |
109 |
|
110 |
Both of the issues make the format hard to use without dedicated tools, |
111 |
or when the tools misbehave. This impacts the following scenarios: |
112 |
|
113 |
A. **Using binary packages for system recovery.** In case of serious |
114 |
breakage, it is really preferable that the format depends on as few |
115 |
tools a possible, and especially not on Gentoo-specific tools. |
116 |
|
117 |
B. **Inspecting binary packages in detail exceeding standard package |
118 |
manager facilities.** |
119 |
|
120 |
C. **Modifying binary packages in ways not predicted by the package |
121 |
manager authors.** A real-life example of this is working around |
122 |
broken ``pkg_*`` phases which prevent the package from being |
123 |
installed. |
124 |
|
125 |
|
126 |
OpenPGP extensibility problem |
127 |
----------------------------- |
128 |
|
129 |
There are at least three obvious ways in which the current format could |
130 |
be extended to support OpenPGP signatures, and each of them has its own |
131 |
distinct problem: |
132 |
|
133 |
1. **Adding a detached signature.** This option is non-intrusive but |
134 |
causes the format to no longer be contained in a single file. |
135 |
|
136 |
2. **Wrapping the package in OpenPGP message format.** This would use |
137 |
a standard format and make verification and unpacking relatively |
138 |
easy. However, it would break backwards compatibility and add |
139 |
explicit dependency on OpenPGP implementation in order to unpack |
140 |
the package. |
141 |
|
142 |
3. **Adding OpenPGP signature as extra XPAK member.** This is |
143 |
the clever solution. It implies strengthening the dependency |
144 |
on custom tooling, now additionally necessary to extract |
145 |
the signature and reconstruct the original file to accommodate |
146 |
verification. |
147 |
|
148 |
|
149 |
Goals for a new container format |
150 |
-------------------------------- |
151 |
|
152 |
All of the above considered, the new format should combine |
153 |
the advantages of the existing format and at the same time address its |
154 |
deficiencies whenever possible. Furthermore, since a format replacement |
155 |
is taking place it is worthwhile to consider additional goals that could |
156 |
be satisfied with little change. |
157 |
|
158 |
The following obligatory goals have been set for a replacement format: |
159 |
|
160 |
1. **The packages must remain contained in a single file.** As a matter |
161 |
of user convenience, it should be possible to transfer binary |
162 |
packages without having to use multiple files, and to install them |
163 |
from any location. |
164 |
|
165 |
2. **The file format must be entirely based on common file formats, |
166 |
respecting best practices, with as little customization as necessary |
167 |
to satisfy the requirements.** The format should be transparent |
168 |
enough to let user inspect and manipulate it without special tooling |
169 |
or detailed knowledge. |
170 |
|
171 |
3. **The file format must provide support for OpenPGP signatures.** |
172 |
Preferably, it should use standard OpenPGP message formats. |
173 |
|
174 |
4. **The file format must allow for efficient metadata updates.** |
175 |
In particular, it should be possible to update the metadata without |
176 |
having to recompress package files. |
177 |
|
178 |
Additionally, the following optional goals have been noted: |
179 |
|
180 |
A. **The file format should account for easy recognition both through |
181 |
filename and through contents.** Preferably, it should have distinct |
182 |
features making it possible to detect it via file(1). |
183 |
|
184 |
B. **The file format should provide for partial fetching of binary |
185 |
packages.** It should be possible to easily fetch and read |
186 |
the package metadata without having to download the whole package. |
187 |
|
188 |
C. **The file format should allow for metadata compression.** |
189 |
|
190 |
D. **The file format should make future extensions easily possible |
191 |
without breaking backwards compatibility.** |
192 |
|
193 |
|
194 |
Specification |
195 |
============= |
196 |
|
197 |
The container format |
198 |
-------------------- |
199 |
|
200 |
The gpkg package container is an uncompressed .tar achive whose filename |
201 |
should use ``.gpkg.tar`` suffix. This archive contains the following |
202 |
members, in order: |
203 |
|
204 |
1. A volume label: ``gpkg: ${full_package_identifier}`` (optional). |
205 |
|
206 |
2. A signature for the metadata archive: ``metadata.tar${comp}.sig`` |
207 |
(optional). |
208 |
|
209 |
3. The metadata archive ``metadata.tar${comp}``, optionally compressed |
210 |
(required). |
211 |
|
212 |
4. A signature for the filesystem image archive: |
213 |
``image.tar${comp}.sig`` (optional). |
214 |
|
215 |
5. The filesystem image archive ``image.tar${comp}``, optionally |
216 |
compressed (required). |
217 |
|
218 |
It is recommended that relative order of the archive members is |
219 |
preserved. However, implementations must support archives with members |
220 |
out of order. |
221 |
|
222 |
The container may be extended with additional members in the future. |
223 |
The implementations should ignore unrecognized members and preserve |
224 |
them across package updates. |
225 |
|
226 |
|
227 |
The volume label |
228 |
---------------- |
229 |
|
230 |
The volume label provides an easy way for users to identify the binary |
231 |
package without dedicated tooling or specific format knowledge. |
232 |
|
233 |
The implementations should include a volume label consisting of fixed |
234 |
string ``gpkg:``, followed by a single space, followed by full package |
235 |
identifier. However, the implementations must not rely on the volume |
236 |
label being present or attempt to parse its value when it is. |
237 |
|
238 |
Furthermore, since the volume label is included in the .tar archive |
239 |
as the first member, it provides a magic string at a fixed location |
240 |
that can be used by tools such as file(1) to easily distinguish Gentoo |
241 |
binary packages from regular .tar archives. |
242 |
|
243 |
|
244 |
The metadata archive |
245 |
-------------------- |
246 |
|
247 |
The metadata archive stores the package metadata needed for the package |
248 |
manager to process it. The archive should be included at the beginning |
249 |
of the binary package in order to make it possible to read it out of |
250 |
partially fetched binary package, and to avoid fetching the remaining |
251 |
part of the package if not necessary. |
252 |
|
253 |
The archive contains a single directory called ``metadata``. In this |
254 |
directory, the individual metadata keys are stored as files. The exact |
255 |
keys and metadata format is outside the scope of this specification. |
256 |
|
257 |
The package manager may need to modify the package metadata. In this |
258 |
case, it should replace the metadata archive without having to alter |
259 |
other package members. |
260 |
|
261 |
The metadata archive can optionally be compressed. It can also be |
262 |
supplemented with a detached OpenPGP signature. |
263 |
|
264 |
|
265 |
The image archive |
266 |
----------------- |
267 |
|
268 |
The image archive stores all the files to be installed by the binary |
269 |
package. It should be included as the last of the files in the binary |
270 |
package container. |
271 |
|
272 |
The archive contains a single directory called ``image``. Inside this |
273 |
directory, all package files are stored in filesystem layout, relative |
274 |
to the root directory. |
275 |
|
276 |
The image archive can optionally be compressed. It can also be |
277 |
supplemented with a detached OpenPGP signature. |
278 |
|
279 |
|
280 |
Archive member compression |
281 |
-------------------------- |
282 |
|
283 |
The archive members outlined above support optional compression using |
284 |
one of the compressed file formats supported by the package manager. |
285 |
The exact list of compression types is outside the scope of this |
286 |
specification. |
287 |
|
288 |
The implementations must support archive members being uncompressed, |
289 |
and must support using different compression types for different files. |
290 |
|
291 |
When compressing an archive member, the member filename should be |
292 |
suffixed using the standard suffix for the particular compressed file |
293 |
type (e.g. ``.bz2`` for bzip2 format). |
294 |
|
295 |
|
296 |
OpenPGP member signatures |
297 |
------------------------- |
298 |
|
299 |
The archive members support optional OpenPGP signatures. |
300 |
The implementations must allow the user to specify whether OpenPGP |
301 |
signatures are to be expected in remotely fetched packages. |
302 |
|
303 |
If the signatures are expected and the archive member is unsigned, the |
304 |
package manager must reject processing it. If the signature does not |
305 |
verify, the package manager must reject processing the corresponding |
306 |
archive member. In particular, it must not attempt decompressing |
307 |
compressed members in those circumstances. |
308 |
|
309 |
The signatures are created as binary detached OpenPGP signature files, |
310 |
with filename corresponding to the member filename with ``.sig`` suffix |
311 |
appended. |
312 |
|
313 |
The exact details regarding creating and verifying signatures, as well |
314 |
as maintaining and distributing keys are outside the scope of this |
315 |
specification. |
316 |
|
317 |
|
318 |
Versioning and format recognition |
319 |
--------------------------------- |
320 |
|
321 |
The container format does not provide an explicit magic identifier |
322 |
or version number. The implementations should recognize binary packages |
323 |
through recognizing the uncompressed .tar archive format, |
324 |
and investigating its contents. Generally, the presence of metadata |
325 |
archive should be sufficient to assume that the package conforms to this |
326 |
specification. |
327 |
|
328 |
If the package format needs to be changed in incompatible way, it should |
329 |
be done in such a way as to make the above check fail. For example, |
330 |
the metadata archive can be renamed to ``metadata-r1.tar*``. |
331 |
|
332 |
|
333 |
Rationale |
334 |
========= |
335 |
|
336 |
Nested archive format |
337 |
--------------------- |
338 |
|
339 |
The basic problem in designing the new format was how to embed multiple |
340 |
data streams (metadata, image) into a single file. Traditionally, this |
341 |
has been done via using two non-conflicting file formats. However, |
342 |
while such a solution is clever, it suffers in terms of transparency. |
343 |
|
344 |
Therefore, it has been established that the new format should really |
345 |
consist of a single archive format, with all necessary data |
346 |
transparently accessible inside the file. Consequently, it has been |
347 |
debated how different parts of binary package data should be stored |
348 |
inside that archive. |
349 |
|
350 |
The proposal to continue storing image data as top-level data |
351 |
in the package format, and store metadata as special directory in that |
352 |
structure has been discarded as a case of in-band signalling. |
353 |
|
354 |
Finally, the proposal has been shaped to store different kinds of data |
355 |
as nested archives in the outer binary package container. Besides |
356 |
providing a clean way of accessing different kinds of information, it |
357 |
makes it possible to add separate OpenPGP signatures to them. |
358 |
|
359 |
|
360 |
Inner vs. outer compression |
361 |
--------------------------- |
362 |
|
363 |
One of the points in the new format debate was whether the binary |
364 |
package as a whole should be compressed vs. compressing individual |
365 |
members. The first option may seem as an obvious choice, especially |
366 |
given that with a larger data set, the compression may proceed more |
367 |
effectively. However, it has a single strong disadvantage: compression |
368 |
prevents random access and manipulation of the binary package members. |
369 |
|
370 |
While for the purpose of reading binary packages, the problem could be |
371 |
circumvented through convenient member ordering and avoiding disjoint |
372 |
reads of the binary package, metadata updates would either require |
373 |
recompressing the whole package (which could be really time consuming |
374 |
with large packages) or applying complex techniques such as splitting |
375 |
the compressed archive into multiple compressed streams. |
376 |
|
377 |
This considered, the simplest solution is to apply compression to |
378 |
the individual package members, while leaving the container format |
379 |
uncompressed. It provides fast random access to the individual members, |
380 |
as well as capability of updating them without the necessity of |
381 |
recompressing other files in the container. |
382 |
|
383 |
This also makes it possible to easily protect compressed files using |
384 |
standard OpenPGP detached signature format. All this combined, |
385 |
the package manager may perform partial fetch of binary package, verify |
386 |
the signature of its metadata member and process it without having to |
387 |
fetch the potentially-large image part. |
388 |
|
389 |
|
390 |
Container and archive formats |
391 |
----------------------------- |
392 |
|
393 |
During the debate, the actual archive formats to use were considered. |
394 |
The .tar format seemed an obvious choice for the image archive since |
395 |
it is the only widely deployed archive format that stores all kinds |
396 |
of file metadata on POSIX systems. However, multiple options for |
397 |
the outer format has been debated. |
398 |
|
399 |
Firstly, the ZIP format has been proposed as the only commonly supported |
400 |
format supporting adding files from stdin (i.e. making it possible to |
401 |
pipe the inner archives straight into the container without using |
402 |
temporary files). However, this format has been clearly rejected |
403 |
as both not being present in the system set, and being trailer-based |
404 |
and therefore unusable without having to fetch the whole file. |
405 |
|
406 |
Secondly, the ar and cpio formats were considered. The former is used |
407 |
by Debian and its derivative binary packages; the latter is used by Red |
408 |
Hat derivatives. Both formats have the advantage of having less |
409 |
historical baggage than .tar, and having less overhead. However, both |
410 |
are also rather obscure (especially given that ar is actually provided |
411 |
by GNU binutils rather than as a stand-alone archiver), considered |
412 |
obsolete by POSIX and both have file size limitations smaller than .tar. |
413 |
|
414 |
Thirdly, SquashFS was another interesting option. Its main advantage is |
415 |
transparent compression support and ability to mount as a filesystem. |
416 |
However, it has a significant implementation complexity, including mount |
417 |
management and necessity of fallback to unsquashfs. Since the image |
418 |
needs to be writable for the pre-installation manipulations, using it |
419 |
via a mount would additionally require some kind of overlay filesystem. |
420 |
Using it as top-level format has no real gain over a pipeline with tar, |
421 |
and is certainly less portable. Therefore, there does not seem to be |
422 |
a benefit in using SquashFS. |
423 |
|
424 |
All that considered, it has been decided that there is no purpose |
425 |
in using a second archive format in the specification unless it has |
426 |
significant advantage to .tar. Therefore, .tar has also been used |
427 |
as outer package format, even though it has larger overhead than other |
428 |
formats (mostly due to padding). |
429 |
|
430 |
|
431 |
Member ordering |
432 |
--------------- |
433 |
|
434 |
The member ordering is explicitly specified in order to provide for |
435 |
trivially reading metadata from partially fetched archives. |
436 |
By requiring the metadata archive to be stored before the image archive, |
437 |
the package manager may stop fetching after reading it and save |
438 |
bandwidth and/or space. |
439 |
|
440 |
|
441 |
Detached OpenPGP signatures |
442 |
--------------------------- |
443 |
|
444 |
The use of detached OpenPGP signatures is to provide authenticity checks |
445 |
for binary packages. Covering the complete members with signatures |
446 |
provide for trivial verification of all metadata and image contents |
447 |
respectively, without having to invent custom mechanisms for combining |
448 |
them. Covering the compressed archives helps to prevent zipbomb |
449 |
attacks. Covering the individual members rather than the whole package |
450 |
provides for verification of partially fetched binary packages. |
451 |
|
452 |
|
453 |
Format versioning |
454 |
----------------- |
455 |
|
456 |
It has been requested that an explicit version identifier is added |
457 |
into the binary package containers in order to account for possible |
458 |
incompatible changes in the format. However, such an explicit notion |
459 |
does not seem necessary. |
460 |
|
461 |
Firstly, the format is meant to be extensible while preserving backwards |
462 |
compatibility. If a backwards-incompatible change needs to be done, |
463 |
and that change does not cause the packages implicitly incompatible |
464 |
by design, the incompatibility can be easily forced e.g. via renaming |
465 |
the metadata archive to ``metadata-r1.tar*``. |
466 |
|
467 |
Secondly, the only really clean place for such a version would be |
468 |
an additional file which would unnecessary grow the uncompressed |
469 |
tarball. The label is non-obligatory and user-oriented, and as such can |
470 |
not be used to carry information significant to the package manager. |
471 |
|
472 |
Finally, such a version number can be added into the metadata archive |
473 |
which needs to be processed by the package manager to extract all |
474 |
significant binary package information. |
475 |
|
476 |
|
477 |
Backwards Compatibility |
478 |
======================= |
479 |
|
480 |
The format does not preserve backwards compatibility with the tbz2 |
481 |
packages. It has been established that preserving compatibility with |
482 |
the old format was impossible without making the new format even worse |
483 |
than the old one was. |
484 |
|
485 |
For example, adding any visible members to the tarball would cause |
486 |
them to be installed to the filesystem by old Portage versions. Working |
487 |
around this would require some kind of awful hacks that would oppose |
488 |
the goal of using simple and transparent package format. |
489 |
|
490 |
|
491 |
Reference Implementation |
492 |
======================== |
493 |
|
494 |
The proof-of-concept implementation of binary package format converter |
495 |
is available as xpak2gpkg [#XPAK2GPKG]_. It can be used to easily |
496 |
create packages in the new format for early inspection. |
497 |
|
498 |
|
499 |
References |
500 |
========== |
501 |
|
502 |
.. [#MAN-XPAK] xpak - The XPAK Data Format used with Portage binary |
503 |
packages |
504 |
(https://dev.gentoo.org/~zmedico/portage/doc/man/xpak.5.html) |
505 |
|
506 |
.. [#PORTAGE-UTILS] portage-utils: Small and fast Portage helper tools |
507 |
written in C |
508 |
(https://packages.gentoo.org/packages/app-portage/portage-utils) |
509 |
|
510 |
.. [#XPAK2GPKG] xpak2gpkg: Proof-of-concept converter from tbz2/xpak |
511 |
to gpkg binpkg format |
512 |
(https://github.com/mgorny/xpak2gpkg) |
513 |
|
514 |
|
515 |
Copyright |
516 |
========= |
517 |
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 |
518 |
Unported License. To view a copy of this license, visit |
519 |
http://creativecommons.org/licenses/by-sa/3.0/. |
520 |
|
521 |
-- |
522 |
Best regards, |
523 |
Michał Górny |