Gentoo Archives: gentoo-dev

From: "Michał Górny" <mgorny@g.o>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] [pre-GLEP r1] Gentoo binary package container format
Date: Mon, 19 Nov 2018 18:35:27
Message-Id: 1542652504.26086.4.camel@gentoo.org
In Reply to: [gentoo-dev] [pre-GLEP] Gentoo binary package container format by "Michał Górny"
1 Hi,
2
3 On Sat, 2018-11-17 at 12:21 +0100, Michał Górny wrote:
4 > Here's a pre-GLEP draft based on the earlier discussion on gentoo-
5 > portage-dev mailing list. The specification uses GLEP form as it
6 > provides for cleanly specifying the motivation and rationale.
7
8 Changes in -r1: took into account the feedback and restructured
9 the motivation into pointing out advantages of the existing format,
10 and focusing on the two real issues of non-transparency and OpenPGP
11 implementations deficiencies. Also added a section on why there's no
12 explicit version number.
13
14 > Also available via HTTPS:
15 >
16 > rst: https://dev.gentoo.org/~mgorny/tmp/glep-0078.rst
17 > html: https://dev.gentoo.org/~mgorny/tmp/glep-0078.html
18 >
19
20 ---
21 GLEP: 9999
22 Title: Gentoo binary package container format
23 Author: Michał Górny <mgorny@g.o>
24 Type: Standards Track
25 Status: Draft
26 Version: 1
27 Created: 2018-11-15
28 Last-Modified: 2018-11-16
29 Post-History: 2018-11-17
30 Content-Type: text/x-rst
31 ---
32
33 Abstract
34 ========
35
36 This GLEP proposes a new binary package container format for Gentoo.
37 The current tbz2/XPAK format is shortly described, and its deficiences
38 are explained. Accordingly, the requirements for a new format are set
39 and a gpkg format satisfying them is proposed. The rationale for
40 the design decisions is provided.
41
42
43 Motivation
44 ==========
45
46 The current Portage binary package format
47 -----------------------------------------
48
49 The historical ``.tbz2`` binary package format used by Portage is
50 a concatenation of two distinct formats: header-oriented compressed .tar
51 format (used to hold package files) and trailer-oriented custom XPAK
52 format (used to hold metadata) [#MAN-XPAK]_. The format has already
53 been extended incompatibly twice.
54
55 The first time, support for storing multiple successive builds of binary
56 package for a single ebuild version has been added. This feature relies
57 on appending additional hyphen, followed by an integer to the package
58 filename. It is disabled by default (preserving backwards
59 compatibility) and controlled by ``binpkg-multi-instance`` feature.
60
61 The second time, support for additional compression formats has been
62 added. When format other than bzip2 is used, the ``.tbz2`` suffix
63 is replaced by ``.xpak`` and Portage relies on magic bytes to detect
64 compression used. For backwards compatibility, Portage still defaults
65 to using bzip2; compression program can be switched using
66 ``BINPKG_COMPRESS`` configuration variable.
67
68 Additionally, there have been minor changes to the stored metadata
69 and file storage policies. In particular, behavior regarding
70 ``INSTALL_MASK``, controllable file compression and stripping has
71 changed over time.
72
73
74 The advantages of tbz2/XPAK format
75 ----------------------------------
76
77 The tbz2/XPAK format used by Portage has three interesting features:
78
79 1. **Each binary package is fully contained within a single file.**
80 While this might seem unnecessary, it makes it easier for the user
81 to transfer binary packages without having to be concerned about
82 finding all the necessary files to transfer.
83
84 2. **The binary packages are compatible with regular compressed
85 tarballs, most of the time.** With notable exceptions of historical
86 versions of pbzip2 and the recent zstd compressor, tbz2/XPAK packages
87 can be extracted using regular tar utility with a compressor
88 implementation that discards trailing garbage.
89
90 3. **The metadata is uncompressed, and can be efficiently accessed
91 without decompressing package contents.** This includes
92 the possibility of rewriting it (e.g. as a result of package moves)
93 without the necessity of repacking the files.
94
95
96 Transparency problem with the current binary package format
97 -----------------------------------------------------------
98
99 Notwithstanding its advantages, the tbz2/XPAK format has a significant
100 design fault that consists of two issues:
101
102 1. **The XPAK format is a custom binary format with explicit use
103 of binary-encoded file offsets and field lengths.** As such, it is
104 non-trivial to read or edit without specialized tools. Such tools
105 are currently implemented separately from the package manager,
106 as part of the portage-utils toolkit, written in C [#PORTAGE-UTILS]_.
107
108 2. **The tarball compatibility feature relies on obscure feature of
109 ignoring trailing garbage in compressed files**. While this is
110 implemented consistently in most of the compressors, this feature
111 is not really a part of specification but rather traditional
112 behavior. Given that the original reasons for this no longer apply,
113 new compressor implementations are likely to miss support for this.
114
115 Both of the issues make the format hard to use without dedicated tools,
116 or when the tools misbehave. This impacts the following scenarios:
117
118 A. **Using binary packages for system recovery.** In case of serious
119 breakage, it is really preferable that the format depends on as few
120 tools a possible, and especially not on Gentoo-specific tools.
121
122 B. **Inspecting binary packages in detail exceeding standard package
123 manager facilities.**
124
125 C. **Modifying binary packages in ways not predicted by the package
126 manager authors.** A real-life example of this is working around
127 broken ``pkg_*`` phases which prevent the package from being
128 installed.
129
130
131 OpenPGP extensibility problem
132 -----------------------------
133
134 There are at least three obvious ways in which the current format could
135 be extended to support OpenPGP signatures, and each of them has its own
136 distinct problem:
137
138 1. **Adding a detached signature.** This option is non-intrusive but
139 causes the format to no longer be contained in a single file.
140
141 2. **Wrapping the package in OpenPGP message format.** This would use
142 a standard format and make verification and unpacking relatively
143 easy. However, it would break backwards compatibility and add
144 explicit dependency on OpenPGP implementation in order to unpack
145 the package.
146
147 3. **Adding OpenPGP signature as extra XPAK member.** This is
148 the clever solution. It implies strengthening the dependency
149 on custom tooling, now additionally necessary to extract
150 the signature and reconstruct the original file to accommodate
151 verification.
152
153
154 Goals for a new container format
155 --------------------------------
156
157 All of the above considered, the new format should combine
158 the advantages of the existing format and at the same time address its
159 deficiencies whenever possible. Furthermore, since a format replacement
160 is taking place it is worthwhile to consider additional goals that could
161 be satisfied with little change.
162
163 The following obligatory goals have been set for a replacement format:
164
165 1. **The packages must remain contained in a single file.** As a matter
166 of user convenience, it should be possible to transfer binary
167 packages without having to use multiple files, and to install them
168 from any location.
169
170 2. **The file format must be entirely based on common file formats,
171 respecting best practices, with as little customization as necessary
172 to satisfy the requirements.** The format should be transparent
173 enough to let user inspect and manipulate it without special tooling
174 or detailed knowledge.
175
176 3. **The file format must provide support for OpenPGP signatures.**
177 Preferably, it should use standard OpenPGP message formats.
178
179 4. **The file format must allow for efficient metadata updates.**
180 In particular, it should be possible to update the metadata without
181 having to recompress package files.
182
183 Additionally, the following optional goals have been noted:
184
185 A. **The file format should account for easy recognition both through
186 filename and through contents.** Preferably, it should have distinct
187 features making it possible to detect it via file(1).
188
189 B. **The file format should provide for partial fetching of binary
190 packages.** It should be possible to easily fetch and read
191 the package metadata without having to download the whole package.
192
193 C. **The file format should allow for metadata compression.**
194
195 D. **The file format should make future extensions easily possible
196 without breaking backwards compatibility.**
197
198
199 Specification
200 =============
201
202 The container format
203 --------------------
204
205 The gpkg package container is an uncompressed .tar achive whose filename
206 uses ``.gpkg.tar`` suffix. This archive contains the following members,
207 in order:
208
209 1. A volume label: ``gpkg: ${full_package_identifier}`` (optional).
210
211 2. A signature for the metadata archive: ``metadata.tar${comp}.sig``
212 (optional).
213
214 3. The metadata archive ``metadata.tar${comp}``, optionally compressed
215 (required).
216
217 4. A signature for the filesystem image archive:
218 ``image.tar${comp}.sig`` (optional).
219
220 5. The filesystem image archive ``image.tar${comp}``, optionally
221 compressed (required).
222
223 It is recommended that relative order of the archive members is
224 preserved. However, implementations must support archives with members
225 out of order.
226
227 The container may be extended with additional members in the future.
228 The implementations should ignore unrecognized members and preserve
229 them across package updates.
230
231
232 The volume label
233 ----------------
234
235 The volume label provides an easy way for users to identify the binary
236 package without dedicated tooling or specific format knowledge.
237
238 The implementations should include a volume label consisting of fixed
239 string ``gpkg:``, followed by a single space, followed by full package
240 identifier. However, the implementations must not rely on the volume
241 label being present or attempt to parse its value when it is.
242
243 Furthermore, since the volume label is included in the .tar archive
244 as the first member, it provides a magic string at a fixed location
245 that can be used by tools such as file(1) to easily distinguish Gentoo
246 binary packages from regular .tar archives.
247
248
249 The metadata archive
250 --------------------
251
252 The metadata archive stores the package metadata needed for the package
253 manager to process it. The archive should be included at the beginning
254 of the binary package in order to make it possible to read it out of
255 partially fetched binary package, and to avoid fetching the remaining
256 part of the package if not necessary.
257
258 The archive contains a single directory called ``metadata``. In this
259 directory, the individual metadata keys are stored as files. The exact
260 keys and metadata format is outside the scope of this specification.
261
262 The package manager may need to modify the package metadata. In this
263 case, it should replace the metadata archive without having to alter
264 other package members.
265
266 The metadata archive can optionally be compressed. It can also be
267 supplemented with a detached OpenPGP signature.
268
269
270 The image archive
271 -----------------
272
273 The image archive stores all the files to be installed by the binary
274 package. It should be included as the last of the files in the binary
275 package container.
276
277 The archive contains a single directory called ``image``. Inside this
278 directory, all package files are stored in filesystem layout, relative
279 to the root directory.
280
281 The image archive can optionally be compressed. It can also be
282 supplemented with a detached OpenPGP signature.
283
284
285 Archive member compression
286 --------------------------
287
288 The archive members outlined above support optional compression using
289 one of the compressed file formats supported by the package manager.
290 The exact list of compression types is outside the scope of this
291 specification.
292
293 The implementations must support archive members being uncompressed,
294 and must support using different compression types for different files.
295
296 When compressing an archive member, the member filename should be
297 suffixed using the standard suffix for the particular compressed file
298 type (e.g. ``.bz2`` for bzip2 format).
299
300
301 OpenPGP member signatures
302 -------------------------
303
304 The archive members support optional OpenPGP signatures.
305 The implementations must allow the user to specify whether OpenPGP
306 signatures are to be expected in remotely fetched packages.
307
308 If the signatures are expected and the archive member is unsigned, the
309 package manager must reject processing it. If the signature does not
310 verify, the package manager must reject processing the corresponding
311 archive member. In particular, it must not attempt decompressing
312 compressed members in those circumstances.
313
314 If the implementation needs to manipulate archive members, it must
315 either create a new signature or discard the existing signature.
316
317 The signatures are created as binary detached OpenPGP signature files,
318 with filename corresponding to the member filename with ``.sig`` suffix
319 appended.
320
321
322 Rationale
323 =========
324
325 Nested archive format
326 ---------------------
327
328 The basic problem in designing the new format was how to embed multiple
329 data streams (metadata, image) into a single file. Traditionally, this
330 has been done via using two non-conflicting file formats. However,
331 while such a solution is clever, it suffers in terms of transparency.
332
333 Therefore, it has been established that the new format should really
334 consist of a single archive format, with all necessary data
335 transparently accessible inside the file. Consequently, it has been
336 debated how different parts of binary package data should be stored
337 inside that archive.
338
339 The proposal to continue storing image data as top-level data
340 in the package format, and store metadata as special directory in that
341 structure has been discarded as a case of in-band signalling.
342
343 Finally, the proposal has been shaped to store different kinds of data
344 as nested archives in the outer binary package container. Besides
345 providing a clean way of accessing different kinds of information, it
346 makes it possible to add separate OpenPGP signatures to them.
347
348
349 Inner vs. outer compression
350 ---------------------------
351
352 One of the points in the new format debate was whether the binary
353 package as a whole should be compressed vs. compressing individual
354 members. The first option may seem as an obvious choice, especially
355 given that with a larger data set, the compression may proceed more
356 effectively. However, it has a single strong disadvantage: compression
357 prevents random access and manipulation of the binary package members.
358
359 While for the purpose of reading binary packages, the problem could be
360 circumvented through convenient member ordering and avoiding disjoint
361 reads of the binary package, metadata updates would either require
362 recompressing the whole package (which could be really time consuming
363 with large packages) or applying complex techniques such as splitting
364 the compressed archive into multiple compressed streams.
365
366 This considered, the simplest solution is to apply compression to
367 the individual package members, while leaving the container format
368 uncompressed. It provides fast random access to the individual members,
369 as well as capability of updating them without the necessity of
370 recompressing other files in the container.
371
372 This also makes it possible to easily protect compressed files using
373 standard OpenPGP detached signature format. All this combined,
374 the package manager may perform partial fetch of binary package, verify
375 the signature of its metadata member and process it without having to
376 fetch the potentially-large image part.
377
378
379 Container and archive formats
380 -----------------------------
381
382 During the debate, the actual archive formats to use were considered.
383 The .tar format seemed an obvious choice for the image archive since
384 it is the only widely deployed archive format that stores all kinds
385 of file metadata on POSIX systems. However, multiple options for
386 the outer format has been debated.
387
388 Firstly, the ZIP format has been proposed as the only commonly supported
389 format supporting adding files from stdin (i.e. making it possible to
390 pipe the inner archives straight into the container without using
391 temporary files). However, this format has been clearly rejected
392 as both not being present in the system set, and being trailer-based
393 and therefore unusable without having to fetch the whole file.
394
395 Secondly, the ar and cpio formats were considered. The former is used
396 by Debian and its derivative binary packages; the latter is used by Red
397 Hat derivatives. Both formats have the advantage of having less
398 historical baggage than .tar, and having less overhead. However, both
399 are also rather obscure (especially given that ar is actually provided
400 by GNU binutils rather than as a stand-alone archiver), considered
401 obsolete by POSIX and both have file size limitations smaller than .tar.
402
403 All that considered, it has been decided that there is no purpose
404 in using a second archive format in the specification unless it has
405 significant advantage to .tar. Therefore, .tar has also been used
406 as outer package format, even though it has larger overhead than other
407 formats (mostly due to padding).
408
409
410 Member ordering
411 ---------------
412
413 The member ordering is explicitly specified in order to provide for
414 trivially reading metadata from partially fetched archives.
415 By requiring the metadata archive to be stored before the image archive,
416 the package manager may stop fetching after reading it and save
417 bandwidth and/or space.
418
419
420 Detached OpenPGP signatures
421 ---------------------------
422
423 The use of detached OpenPGP signatures is to provide authenticity checks
424 for binary packages. Covering the complete members with signatures
425 provide for trivial verification of all metadata and image contents
426 respectively, without having to invent custom mechanisms for combining
427 them. Covering the compressed archives helps to prevent zipbomb
428 attacks. Covering the individual members rather than the whole package
429 provides for verification of partially fetched binary packages.
430
431
432 Format versioning
433 -----------------
434
435 It has been requested that an explicit version identifier is added
436 into the binary package containers in order to account for possible
437 incompatible changes in the format. However, such an explicit notion
438 does not seem necessary.
439
440 Firstly, the format is meant to be extensible while preserving backwards
441 compatibility. If a backwards-incompatible change needs to be done,
442 and that change does not cause the packages implicitly incompatible
443 by design, the incompatibility can be easily forced e.g. via renaming
444 the metadata archive to ``metadata-v2.tar*``.
445
446 Secondly, the only really clean place for such a version would be
447 an additional file which would unnecessary grow the uncompressed
448 tarball. The label is non-obligatory and user-oriented, and as such can
449 not be used to carry information significant to the package manager.
450
451 Finally, such a version number can be added into the metadata archive
452 which needs to be processed by the package manager to extract all
453 significant binary package information.
454
455
456 Backwards Compatibility
457 =======================
458
459 The format does not preserve backwards compatibility with the tbz2
460 packages. It has been established that preserving compatibility with
461 the old format was impossible without making the new format even worse
462 than the old one was.
463
464 For example, adding any visible members to the tarball would cause
465 them to be installed to the filesystem by old Portage versions. Working
466 around this would require some kind of awful hacks that would oppose
467 the goal of using simple and transparent package format.
468
469
470 Reference Implementation
471 ========================
472
473 The proof-of-concept implementation of binary package format converter
474 is available as xpak2gpkg [#XPAK2GPKG]_. It can be used to easily
475 create packages in the new format for early inspection.
476
477
478 References
479 ==========
480
481 .. [#MAN-XPAK] xpak - The XPAK Data Format used with Portage binary
482 packages
483 (https://dev.gentoo.org/~zmedico/portage/doc/man/xpak.5.html)
484
485 .. [#PORTAGE-UTILS] portage-utils: Small and fast Portage helper tools
486 written in C
487 (https://packages.gentoo.org/packages/app-portage/portage-utils)
488
489 .. [#XPAK2GPKG] xpak2gpkg: Proof-of-concept converter from tbz2/xpak
490 to gpkg binpkg format
491 (https://github.com/mgorny/xpak2gpkg)
492
493
494 Copyright
495 =========
496 This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
497 Unported License. To view a copy of this license, visit
498 http://creativecommons.org/licenses/by-sa/3.0/.
499
500 --
501 Best regards,
502 Michał Górny

Attachments

File name MIME type
signature.asc application/pgp-signature

Replies

Subject Author
Re: [gentoo-dev] [pre-GLEP r1] Gentoo binary package container format Roy Bamford <neddyseagoon@g.o>