1 |
Hi, |
2 |
|
3 |
On Sat, 2018-11-17 at 12:21 +0100, Michał Górny wrote: |
4 |
> Here's a pre-GLEP draft based on the earlier discussion on gentoo- |
5 |
> portage-dev mailing list. The specification uses GLEP form as it |
6 |
> provides for cleanly specifying the motivation and rationale. |
7 |
|
8 |
Changes in -r1: took into account the feedback and restructured |
9 |
the motivation into pointing out advantages of the existing format, |
10 |
and focusing on the two real issues of non-transparency and OpenPGP |
11 |
implementations deficiencies. Also added a section on why there's no |
12 |
explicit version number. |
13 |
|
14 |
> Also available via HTTPS: |
15 |
> |
16 |
> rst: https://dev.gentoo.org/~mgorny/tmp/glep-0078.rst |
17 |
> html: https://dev.gentoo.org/~mgorny/tmp/glep-0078.html |
18 |
> |
19 |
|
20 |
--- |
21 |
GLEP: 9999 |
22 |
Title: Gentoo binary package container format |
23 |
Author: Michał Górny <mgorny@g.o> |
24 |
Type: Standards Track |
25 |
Status: Draft |
26 |
Version: 1 |
27 |
Created: 2018-11-15 |
28 |
Last-Modified: 2018-11-16 |
29 |
Post-History: 2018-11-17 |
30 |
Content-Type: text/x-rst |
31 |
--- |
32 |
|
33 |
Abstract |
34 |
======== |
35 |
|
36 |
This GLEP proposes a new binary package container format for Gentoo. |
37 |
The current tbz2/XPAK format is shortly described, and its deficiences |
38 |
are explained. Accordingly, the requirements for a new format are set |
39 |
and a gpkg format satisfying them is proposed. The rationale for |
40 |
the design decisions is provided. |
41 |
|
42 |
|
43 |
Motivation |
44 |
========== |
45 |
|
46 |
The current Portage binary package format |
47 |
----------------------------------------- |
48 |
|
49 |
The historical ``.tbz2`` binary package format used by Portage is |
50 |
a concatenation of two distinct formats: header-oriented compressed .tar |
51 |
format (used to hold package files) and trailer-oriented custom XPAK |
52 |
format (used to hold metadata) [#MAN-XPAK]_. The format has already |
53 |
been extended incompatibly twice. |
54 |
|
55 |
The first time, support for storing multiple successive builds of binary |
56 |
package for a single ebuild version has been added. This feature relies |
57 |
on appending additional hyphen, followed by an integer to the package |
58 |
filename. It is disabled by default (preserving backwards |
59 |
compatibility) and controlled by ``binpkg-multi-instance`` feature. |
60 |
|
61 |
The second time, support for additional compression formats has been |
62 |
added. When format other than bzip2 is used, the ``.tbz2`` suffix |
63 |
is replaced by ``.xpak`` and Portage relies on magic bytes to detect |
64 |
compression used. For backwards compatibility, Portage still defaults |
65 |
to using bzip2; compression program can be switched using |
66 |
``BINPKG_COMPRESS`` configuration variable. |
67 |
|
68 |
Additionally, there have been minor changes to the stored metadata |
69 |
and file storage policies. In particular, behavior regarding |
70 |
``INSTALL_MASK``, controllable file compression and stripping has |
71 |
changed over time. |
72 |
|
73 |
|
74 |
The advantages of tbz2/XPAK format |
75 |
---------------------------------- |
76 |
|
77 |
The tbz2/XPAK format used by Portage has three interesting features: |
78 |
|
79 |
1. **Each binary package is fully contained within a single file.** |
80 |
While this might seem unnecessary, it makes it easier for the user |
81 |
to transfer binary packages without having to be concerned about |
82 |
finding all the necessary files to transfer. |
83 |
|
84 |
2. **The binary packages are compatible with regular compressed |
85 |
tarballs, most of the time.** With notable exceptions of historical |
86 |
versions of pbzip2 and the recent zstd compressor, tbz2/XPAK packages |
87 |
can be extracted using regular tar utility with a compressor |
88 |
implementation that discards trailing garbage. |
89 |
|
90 |
3. **The metadata is uncompressed, and can be efficiently accessed |
91 |
without decompressing package contents.** This includes |
92 |
the possibility of rewriting it (e.g. as a result of package moves) |
93 |
without the necessity of repacking the files. |
94 |
|
95 |
|
96 |
Transparency problem with the current binary package format |
97 |
----------------------------------------------------------- |
98 |
|
99 |
Notwithstanding its advantages, the tbz2/XPAK format has a significant |
100 |
design fault that consists of two issues: |
101 |
|
102 |
1. **The XPAK format is a custom binary format with explicit use |
103 |
of binary-encoded file offsets and field lengths.** As such, it is |
104 |
non-trivial to read or edit without specialized tools. Such tools |
105 |
are currently implemented separately from the package manager, |
106 |
as part of the portage-utils toolkit, written in C [#PORTAGE-UTILS]_. |
107 |
|
108 |
2. **The tarball compatibility feature relies on obscure feature of |
109 |
ignoring trailing garbage in compressed files**. While this is |
110 |
implemented consistently in most of the compressors, this feature |
111 |
is not really a part of specification but rather traditional |
112 |
behavior. Given that the original reasons for this no longer apply, |
113 |
new compressor implementations are likely to miss support for this. |
114 |
|
115 |
Both of the issues make the format hard to use without dedicated tools, |
116 |
or when the tools misbehave. This impacts the following scenarios: |
117 |
|
118 |
A. **Using binary packages for system recovery.** In case of serious |
119 |
breakage, it is really preferable that the format depends on as few |
120 |
tools a possible, and especially not on Gentoo-specific tools. |
121 |
|
122 |
B. **Inspecting binary packages in detail exceeding standard package |
123 |
manager facilities.** |
124 |
|
125 |
C. **Modifying binary packages in ways not predicted by the package |
126 |
manager authors.** A real-life example of this is working around |
127 |
broken ``pkg_*`` phases which prevent the package from being |
128 |
installed. |
129 |
|
130 |
|
131 |
OpenPGP extensibility problem |
132 |
----------------------------- |
133 |
|
134 |
There are at least three obvious ways in which the current format could |
135 |
be extended to support OpenPGP signatures, and each of them has its own |
136 |
distinct problem: |
137 |
|
138 |
1. **Adding a detached signature.** This option is non-intrusive but |
139 |
causes the format to no longer be contained in a single file. |
140 |
|
141 |
2. **Wrapping the package in OpenPGP message format.** This would use |
142 |
a standard format and make verification and unpacking relatively |
143 |
easy. However, it would break backwards compatibility and add |
144 |
explicit dependency on OpenPGP implementation in order to unpack |
145 |
the package. |
146 |
|
147 |
3. **Adding OpenPGP signature as extra XPAK member.** This is |
148 |
the clever solution. It implies strengthening the dependency |
149 |
on custom tooling, now additionally necessary to extract |
150 |
the signature and reconstruct the original file to accommodate |
151 |
verification. |
152 |
|
153 |
|
154 |
Goals for a new container format |
155 |
-------------------------------- |
156 |
|
157 |
All of the above considered, the new format should combine |
158 |
the advantages of the existing format and at the same time address its |
159 |
deficiencies whenever possible. Furthermore, since a format replacement |
160 |
is taking place it is worthwhile to consider additional goals that could |
161 |
be satisfied with little change. |
162 |
|
163 |
The following obligatory goals have been set for a replacement format: |
164 |
|
165 |
1. **The packages must remain contained in a single file.** As a matter |
166 |
of user convenience, it should be possible to transfer binary |
167 |
packages without having to use multiple files, and to install them |
168 |
from any location. |
169 |
|
170 |
2. **The file format must be entirely based on common file formats, |
171 |
respecting best practices, with as little customization as necessary |
172 |
to satisfy the requirements.** The format should be transparent |
173 |
enough to let user inspect and manipulate it without special tooling |
174 |
or detailed knowledge. |
175 |
|
176 |
3. **The file format must provide support for OpenPGP signatures.** |
177 |
Preferably, it should use standard OpenPGP message formats. |
178 |
|
179 |
4. **The file format must allow for efficient metadata updates.** |
180 |
In particular, it should be possible to update the metadata without |
181 |
having to recompress package files. |
182 |
|
183 |
Additionally, the following optional goals have been noted: |
184 |
|
185 |
A. **The file format should account for easy recognition both through |
186 |
filename and through contents.** Preferably, it should have distinct |
187 |
features making it possible to detect it via file(1). |
188 |
|
189 |
B. **The file format should provide for partial fetching of binary |
190 |
packages.** It should be possible to easily fetch and read |
191 |
the package metadata without having to download the whole package. |
192 |
|
193 |
C. **The file format should allow for metadata compression.** |
194 |
|
195 |
D. **The file format should make future extensions easily possible |
196 |
without breaking backwards compatibility.** |
197 |
|
198 |
|
199 |
Specification |
200 |
============= |
201 |
|
202 |
The container format |
203 |
-------------------- |
204 |
|
205 |
The gpkg package container is an uncompressed .tar achive whose filename |
206 |
uses ``.gpkg.tar`` suffix. This archive contains the following members, |
207 |
in order: |
208 |
|
209 |
1. A volume label: ``gpkg: ${full_package_identifier}`` (optional). |
210 |
|
211 |
2. A signature for the metadata archive: ``metadata.tar${comp}.sig`` |
212 |
(optional). |
213 |
|
214 |
3. The metadata archive ``metadata.tar${comp}``, optionally compressed |
215 |
(required). |
216 |
|
217 |
4. A signature for the filesystem image archive: |
218 |
``image.tar${comp}.sig`` (optional). |
219 |
|
220 |
5. The filesystem image archive ``image.tar${comp}``, optionally |
221 |
compressed (required). |
222 |
|
223 |
It is recommended that relative order of the archive members is |
224 |
preserved. However, implementations must support archives with members |
225 |
out of order. |
226 |
|
227 |
The container may be extended with additional members in the future. |
228 |
The implementations should ignore unrecognized members and preserve |
229 |
them across package updates. |
230 |
|
231 |
|
232 |
The volume label |
233 |
---------------- |
234 |
|
235 |
The volume label provides an easy way for users to identify the binary |
236 |
package without dedicated tooling or specific format knowledge. |
237 |
|
238 |
The implementations should include a volume label consisting of fixed |
239 |
string ``gpkg:``, followed by a single space, followed by full package |
240 |
identifier. However, the implementations must not rely on the volume |
241 |
label being present or attempt to parse its value when it is. |
242 |
|
243 |
Furthermore, since the volume label is included in the .tar archive |
244 |
as the first member, it provides a magic string at a fixed location |
245 |
that can be used by tools such as file(1) to easily distinguish Gentoo |
246 |
binary packages from regular .tar archives. |
247 |
|
248 |
|
249 |
The metadata archive |
250 |
-------------------- |
251 |
|
252 |
The metadata archive stores the package metadata needed for the package |
253 |
manager to process it. The archive should be included at the beginning |
254 |
of the binary package in order to make it possible to read it out of |
255 |
partially fetched binary package, and to avoid fetching the remaining |
256 |
part of the package if not necessary. |
257 |
|
258 |
The archive contains a single directory called ``metadata``. In this |
259 |
directory, the individual metadata keys are stored as files. The exact |
260 |
keys and metadata format is outside the scope of this specification. |
261 |
|
262 |
The package manager may need to modify the package metadata. In this |
263 |
case, it should replace the metadata archive without having to alter |
264 |
other package members. |
265 |
|
266 |
The metadata archive can optionally be compressed. It can also be |
267 |
supplemented with a detached OpenPGP signature. |
268 |
|
269 |
|
270 |
The image archive |
271 |
----------------- |
272 |
|
273 |
The image archive stores all the files to be installed by the binary |
274 |
package. It should be included as the last of the files in the binary |
275 |
package container. |
276 |
|
277 |
The archive contains a single directory called ``image``. Inside this |
278 |
directory, all package files are stored in filesystem layout, relative |
279 |
to the root directory. |
280 |
|
281 |
The image archive can optionally be compressed. It can also be |
282 |
supplemented with a detached OpenPGP signature. |
283 |
|
284 |
|
285 |
Archive member compression |
286 |
-------------------------- |
287 |
|
288 |
The archive members outlined above support optional compression using |
289 |
one of the compressed file formats supported by the package manager. |
290 |
The exact list of compression types is outside the scope of this |
291 |
specification. |
292 |
|
293 |
The implementations must support archive members being uncompressed, |
294 |
and must support using different compression types for different files. |
295 |
|
296 |
When compressing an archive member, the member filename should be |
297 |
suffixed using the standard suffix for the particular compressed file |
298 |
type (e.g. ``.bz2`` for bzip2 format). |
299 |
|
300 |
|
301 |
OpenPGP member signatures |
302 |
------------------------- |
303 |
|
304 |
The archive members support optional OpenPGP signatures. |
305 |
The implementations must allow the user to specify whether OpenPGP |
306 |
signatures are to be expected in remotely fetched packages. |
307 |
|
308 |
If the signatures are expected and the archive member is unsigned, the |
309 |
package manager must reject processing it. If the signature does not |
310 |
verify, the package manager must reject processing the corresponding |
311 |
archive member. In particular, it must not attempt decompressing |
312 |
compressed members in those circumstances. |
313 |
|
314 |
If the implementation needs to manipulate archive members, it must |
315 |
either create a new signature or discard the existing signature. |
316 |
|
317 |
The signatures are created as binary detached OpenPGP signature files, |
318 |
with filename corresponding to the member filename with ``.sig`` suffix |
319 |
appended. |
320 |
|
321 |
|
322 |
Rationale |
323 |
========= |
324 |
|
325 |
Nested archive format |
326 |
--------------------- |
327 |
|
328 |
The basic problem in designing the new format was how to embed multiple |
329 |
data streams (metadata, image) into a single file. Traditionally, this |
330 |
has been done via using two non-conflicting file formats. However, |
331 |
while such a solution is clever, it suffers in terms of transparency. |
332 |
|
333 |
Therefore, it has been established that the new format should really |
334 |
consist of a single archive format, with all necessary data |
335 |
transparently accessible inside the file. Consequently, it has been |
336 |
debated how different parts of binary package data should be stored |
337 |
inside that archive. |
338 |
|
339 |
The proposal to continue storing image data as top-level data |
340 |
in the package format, and store metadata as special directory in that |
341 |
structure has been discarded as a case of in-band signalling. |
342 |
|
343 |
Finally, the proposal has been shaped to store different kinds of data |
344 |
as nested archives in the outer binary package container. Besides |
345 |
providing a clean way of accessing different kinds of information, it |
346 |
makes it possible to add separate OpenPGP signatures to them. |
347 |
|
348 |
|
349 |
Inner vs. outer compression |
350 |
--------------------------- |
351 |
|
352 |
One of the points in the new format debate was whether the binary |
353 |
package as a whole should be compressed vs. compressing individual |
354 |
members. The first option may seem as an obvious choice, especially |
355 |
given that with a larger data set, the compression may proceed more |
356 |
effectively. However, it has a single strong disadvantage: compression |
357 |
prevents random access and manipulation of the binary package members. |
358 |
|
359 |
While for the purpose of reading binary packages, the problem could be |
360 |
circumvented through convenient member ordering and avoiding disjoint |
361 |
reads of the binary package, metadata updates would either require |
362 |
recompressing the whole package (which could be really time consuming |
363 |
with large packages) or applying complex techniques such as splitting |
364 |
the compressed archive into multiple compressed streams. |
365 |
|
366 |
This considered, the simplest solution is to apply compression to |
367 |
the individual package members, while leaving the container format |
368 |
uncompressed. It provides fast random access to the individual members, |
369 |
as well as capability of updating them without the necessity of |
370 |
recompressing other files in the container. |
371 |
|
372 |
This also makes it possible to easily protect compressed files using |
373 |
standard OpenPGP detached signature format. All this combined, |
374 |
the package manager may perform partial fetch of binary package, verify |
375 |
the signature of its metadata member and process it without having to |
376 |
fetch the potentially-large image part. |
377 |
|
378 |
|
379 |
Container and archive formats |
380 |
----------------------------- |
381 |
|
382 |
During the debate, the actual archive formats to use were considered. |
383 |
The .tar format seemed an obvious choice for the image archive since |
384 |
it is the only widely deployed archive format that stores all kinds |
385 |
of file metadata on POSIX systems. However, multiple options for |
386 |
the outer format has been debated. |
387 |
|
388 |
Firstly, the ZIP format has been proposed as the only commonly supported |
389 |
format supporting adding files from stdin (i.e. making it possible to |
390 |
pipe the inner archives straight into the container without using |
391 |
temporary files). However, this format has been clearly rejected |
392 |
as both not being present in the system set, and being trailer-based |
393 |
and therefore unusable without having to fetch the whole file. |
394 |
|
395 |
Secondly, the ar and cpio formats were considered. The former is used |
396 |
by Debian and its derivative binary packages; the latter is used by Red |
397 |
Hat derivatives. Both formats have the advantage of having less |
398 |
historical baggage than .tar, and having less overhead. However, both |
399 |
are also rather obscure (especially given that ar is actually provided |
400 |
by GNU binutils rather than as a stand-alone archiver), considered |
401 |
obsolete by POSIX and both have file size limitations smaller than .tar. |
402 |
|
403 |
All that considered, it has been decided that there is no purpose |
404 |
in using a second archive format in the specification unless it has |
405 |
significant advantage to .tar. Therefore, .tar has also been used |
406 |
as outer package format, even though it has larger overhead than other |
407 |
formats (mostly due to padding). |
408 |
|
409 |
|
410 |
Member ordering |
411 |
--------------- |
412 |
|
413 |
The member ordering is explicitly specified in order to provide for |
414 |
trivially reading metadata from partially fetched archives. |
415 |
By requiring the metadata archive to be stored before the image archive, |
416 |
the package manager may stop fetching after reading it and save |
417 |
bandwidth and/or space. |
418 |
|
419 |
|
420 |
Detached OpenPGP signatures |
421 |
--------------------------- |
422 |
|
423 |
The use of detached OpenPGP signatures is to provide authenticity checks |
424 |
for binary packages. Covering the complete members with signatures |
425 |
provide for trivial verification of all metadata and image contents |
426 |
respectively, without having to invent custom mechanisms for combining |
427 |
them. Covering the compressed archives helps to prevent zipbomb |
428 |
attacks. Covering the individual members rather than the whole package |
429 |
provides for verification of partially fetched binary packages. |
430 |
|
431 |
|
432 |
Format versioning |
433 |
----------------- |
434 |
|
435 |
It has been requested that an explicit version identifier is added |
436 |
into the binary package containers in order to account for possible |
437 |
incompatible changes in the format. However, such an explicit notion |
438 |
does not seem necessary. |
439 |
|
440 |
Firstly, the format is meant to be extensible while preserving backwards |
441 |
compatibility. If a backwards-incompatible change needs to be done, |
442 |
and that change does not cause the packages implicitly incompatible |
443 |
by design, the incompatibility can be easily forced e.g. via renaming |
444 |
the metadata archive to ``metadata-v2.tar*``. |
445 |
|
446 |
Secondly, the only really clean place for such a version would be |
447 |
an additional file which would unnecessary grow the uncompressed |
448 |
tarball. The label is non-obligatory and user-oriented, and as such can |
449 |
not be used to carry information significant to the package manager. |
450 |
|
451 |
Finally, such a version number can be added into the metadata archive |
452 |
which needs to be processed by the package manager to extract all |
453 |
significant binary package information. |
454 |
|
455 |
|
456 |
Backwards Compatibility |
457 |
======================= |
458 |
|
459 |
The format does not preserve backwards compatibility with the tbz2 |
460 |
packages. It has been established that preserving compatibility with |
461 |
the old format was impossible without making the new format even worse |
462 |
than the old one was. |
463 |
|
464 |
For example, adding any visible members to the tarball would cause |
465 |
them to be installed to the filesystem by old Portage versions. Working |
466 |
around this would require some kind of awful hacks that would oppose |
467 |
the goal of using simple and transparent package format. |
468 |
|
469 |
|
470 |
Reference Implementation |
471 |
======================== |
472 |
|
473 |
The proof-of-concept implementation of binary package format converter |
474 |
is available as xpak2gpkg [#XPAK2GPKG]_. It can be used to easily |
475 |
create packages in the new format for early inspection. |
476 |
|
477 |
|
478 |
References |
479 |
========== |
480 |
|
481 |
.. [#MAN-XPAK] xpak - The XPAK Data Format used with Portage binary |
482 |
packages |
483 |
(https://dev.gentoo.org/~zmedico/portage/doc/man/xpak.5.html) |
484 |
|
485 |
.. [#PORTAGE-UTILS] portage-utils: Small and fast Portage helper tools |
486 |
written in C |
487 |
(https://packages.gentoo.org/packages/app-portage/portage-utils) |
488 |
|
489 |
.. [#XPAK2GPKG] xpak2gpkg: Proof-of-concept converter from tbz2/xpak |
490 |
to gpkg binpkg format |
491 |
(https://github.com/mgorny/xpak2gpkg) |
492 |
|
493 |
|
494 |
Copyright |
495 |
========= |
496 |
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 |
497 |
Unported License. To view a copy of this license, visit |
498 |
http://creativecommons.org/licenses/by-sa/3.0/. |
499 |
|
500 |
-- |
501 |
Best regards, |
502 |
Michał Górny |