Gentoo Archives: gentoo-dev

From: "Michał Górny" <mgorny@g.o>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] [pre-GLEP r2] Gentoo binary package container format
Date: Tue, 20 Nov 2018 20:33:32
Message-Id: 1542745997.18030.3.camel@gentoo.org
In Reply to: [gentoo-dev] [pre-GLEP] Gentoo binary package container format by "Michał Górny"
1 Hi,
2
3 On Sat, 2018-11-17 at 12:21 +0100, Michał Górny wrote:
4 > Here's a pre-GLEP draft based on the earlier discussion on gentoo-
5 > portage-dev mailing list. The specification uses GLEP form as it
6 > provides for cleanly specifying the motivation and rationale.
7
8 Here's third iteration. Changes since r1:
9 - removed unnecessary OpenPGP details, made them out of scope,
10 - added explicit section on (lack of) versioning and how to recognize
11 packages and their compatibility,
12 - explained why squashfs is a no-go.
13
14
15 ---
16 GLEP: 9999
17 Title: Gentoo binary package container format
18 Author: Michał Górny <mgorny@g.o>
19 Type: Standards Track
20 Status: Draft
21 Version: 1
22 Created: 2018-11-15
23 Last-Modified: 2018-11-20
24 Post-History: 2018-11-17
25 Content-Type: text/x-rst
26 ---
27
28 Abstract
29 ========
30
31 This GLEP proposes a new binary package container format for Gentoo.
32 The current tbz2/XPAK format is shortly described, and its deficiences
33 are explained. Accordingly, the requirements for a new format are set
34 and a gpkg format satisfying them is proposed. The rationale for
35 the design decisions is provided.
36
37
38 Motivation
39 ==========
40
41 The current Portage binary package format
42 -----------------------------------------
43
44 The historical ``.tbz2`` binary package format used by Portage is
45 a concatenation of two distinct formats: header-oriented compressed .tar
46 format (used to hold package files) and trailer-oriented custom XPAK
47 format (used to hold metadata) [#MAN-XPAK]_. The format has already
48 been extended incompatibly twice.
49
50 The first time, support for storing multiple successive builds of binary
51 package for a single ebuild version has been added. This feature relies
52 on appending additional hyphen, followed by an integer to the package
53 filename. It is disabled by default (preserving backwards
54 compatibility) and controlled by ``binpkg-multi-instance`` feature.
55
56 The second time, support for additional compression formats has been
57 added. When format other than bzip2 is used, the ``.tbz2`` suffix
58 is replaced by ``.xpak`` and Portage relies on magic bytes to detect
59 compression used. For backwards compatibility, Portage still defaults
60 to using bzip2; compression program can be switched using
61 ``BINPKG_COMPRESS`` configuration variable.
62
63 Additionally, there have been minor changes to the stored metadata
64 and file storage policies. In particular, behavior regarding
65 ``INSTALL_MASK``, controllable file compression and stripping has
66 changed over time.
67
68
69 The advantages of tbz2/XPAK format
70 ----------------------------------
71
72 The tbz2/XPAK format used by Portage has three interesting features:
73
74 1. **Each binary package is fully contained within a single file.**
75 While this might seem unnecessary, it makes it easier for the user
76 to transfer binary packages without having to be concerned about
77 finding all the necessary files to transfer.
78
79 2. **The binary packages are compatible with regular compressed
80 tarballs, most of the time.** With notable exceptions of historical
81 versions of pbzip2 and the recent zstd compressor, tbz2/XPAK packages
82 can be extracted using regular tar utility with a compressor
83 implementation that discards trailing garbage.
84
85 3. **The metadata is uncompressed, and can be efficiently accessed
86 without decompressing package contents.** This includes
87 the possibility of rewriting it (e.g. as a result of package moves)
88 without the necessity of repacking the files.
89
90
91 Transparency problem with the current binary package format
92 -----------------------------------------------------------
93
94 Notwithstanding its advantages, the tbz2/XPAK format has a significant
95 design fault that consists of two issues:
96
97 1. **The XPAK format is a custom binary format with explicit use
98 of binary-encoded file offsets and field lengths.** As such, it is
99 non-trivial to read or edit without specialized tools. Such tools
100 are currently implemented separately from the package manager,
101 as part of the portage-utils toolkit, written in C [#PORTAGE-UTILS]_.
102
103 2. **The tarball compatibility feature relies on obscure feature of
104 ignoring trailing garbage in compressed files**. While this is
105 implemented consistently in most of the compressors, this feature
106 is not really a part of specification but rather traditional
107 behavior. Given that the original reasons for this no longer apply,
108 new compressor implementations are likely to miss support for this.
109
110 Both of the issues make the format hard to use without dedicated tools,
111 or when the tools misbehave. This impacts the following scenarios:
112
113 A. **Using binary packages for system recovery.** In case of serious
114 breakage, it is really preferable that the format depends on as few
115 tools a possible, and especially not on Gentoo-specific tools.
116
117 B. **Inspecting binary packages in detail exceeding standard package
118 manager facilities.**
119
120 C. **Modifying binary packages in ways not predicted by the package
121 manager authors.** A real-life example of this is working around
122 broken ``pkg_*`` phases which prevent the package from being
123 installed.
124
125
126 OpenPGP extensibility problem
127 -----------------------------
128
129 There are at least three obvious ways in which the current format could
130 be extended to support OpenPGP signatures, and each of them has its own
131 distinct problem:
132
133 1. **Adding a detached signature.** This option is non-intrusive but
134 causes the format to no longer be contained in a single file.
135
136 2. **Wrapping the package in OpenPGP message format.** This would use
137 a standard format and make verification and unpacking relatively
138 easy. However, it would break backwards compatibility and add
139 explicit dependency on OpenPGP implementation in order to unpack
140 the package.
141
142 3. **Adding OpenPGP signature as extra XPAK member.** This is
143 the clever solution. It implies strengthening the dependency
144 on custom tooling, now additionally necessary to extract
145 the signature and reconstruct the original file to accommodate
146 verification.
147
148
149 Goals for a new container format
150 --------------------------------
151
152 All of the above considered, the new format should combine
153 the advantages of the existing format and at the same time address its
154 deficiencies whenever possible. Furthermore, since a format replacement
155 is taking place it is worthwhile to consider additional goals that could
156 be satisfied with little change.
157
158 The following obligatory goals have been set for a replacement format:
159
160 1. **The packages must remain contained in a single file.** As a matter
161 of user convenience, it should be possible to transfer binary
162 packages without having to use multiple files, and to install them
163 from any location.
164
165 2. **The file format must be entirely based on common file formats,
166 respecting best practices, with as little customization as necessary
167 to satisfy the requirements.** The format should be transparent
168 enough to let user inspect and manipulate it without special tooling
169 or detailed knowledge.
170
171 3. **The file format must provide support for OpenPGP signatures.**
172 Preferably, it should use standard OpenPGP message formats.
173
174 4. **The file format must allow for efficient metadata updates.**
175 In particular, it should be possible to update the metadata without
176 having to recompress package files.
177
178 Additionally, the following optional goals have been noted:
179
180 A. **The file format should account for easy recognition both through
181 filename and through contents.** Preferably, it should have distinct
182 features making it possible to detect it via file(1).
183
184 B. **The file format should provide for partial fetching of binary
185 packages.** It should be possible to easily fetch and read
186 the package metadata without having to download the whole package.
187
188 C. **The file format should allow for metadata compression.**
189
190 D. **The file format should make future extensions easily possible
191 without breaking backwards compatibility.**
192
193
194 Specification
195 =============
196
197 The container format
198 --------------------
199
200 The gpkg package container is an uncompressed .tar achive whose filename
201 should use ``.gpkg.tar`` suffix. This archive contains the following
202 members, in order:
203
204 1. A volume label: ``gpkg: ${full_package_identifier}`` (optional).
205
206 2. A signature for the metadata archive: ``metadata.tar${comp}.sig``
207 (optional).
208
209 3. The metadata archive ``metadata.tar${comp}``, optionally compressed
210 (required).
211
212 4. A signature for the filesystem image archive:
213 ``image.tar${comp}.sig`` (optional).
214
215 5. The filesystem image archive ``image.tar${comp}``, optionally
216 compressed (required).
217
218 It is recommended that relative order of the archive members is
219 preserved. However, implementations must support archives with members
220 out of order.
221
222 The container may be extended with additional members in the future.
223 The implementations should ignore unrecognized members and preserve
224 them across package updates.
225
226
227 The volume label
228 ----------------
229
230 The volume label provides an easy way for users to identify the binary
231 package without dedicated tooling or specific format knowledge.
232
233 The implementations should include a volume label consisting of fixed
234 string ``gpkg:``, followed by a single space, followed by full package
235 identifier. However, the implementations must not rely on the volume
236 label being present or attempt to parse its value when it is.
237
238 Furthermore, since the volume label is included in the .tar archive
239 as the first member, it provides a magic string at a fixed location
240 that can be used by tools such as file(1) to easily distinguish Gentoo
241 binary packages from regular .tar archives.
242
243
244 The metadata archive
245 --------------------
246
247 The metadata archive stores the package metadata needed for the package
248 manager to process it. The archive should be included at the beginning
249 of the binary package in order to make it possible to read it out of
250 partially fetched binary package, and to avoid fetching the remaining
251 part of the package if not necessary.
252
253 The archive contains a single directory called ``metadata``. In this
254 directory, the individual metadata keys are stored as files. The exact
255 keys and metadata format is outside the scope of this specification.
256
257 The package manager may need to modify the package metadata. In this
258 case, it should replace the metadata archive without having to alter
259 other package members.
260
261 The metadata archive can optionally be compressed. It can also be
262 supplemented with a detached OpenPGP signature.
263
264
265 The image archive
266 -----------------
267
268 The image archive stores all the files to be installed by the binary
269 package. It should be included as the last of the files in the binary
270 package container.
271
272 The archive contains a single directory called ``image``. Inside this
273 directory, all package files are stored in filesystem layout, relative
274 to the root directory.
275
276 The image archive can optionally be compressed. It can also be
277 supplemented with a detached OpenPGP signature.
278
279
280 Archive member compression
281 --------------------------
282
283 The archive members outlined above support optional compression using
284 one of the compressed file formats supported by the package manager.
285 The exact list of compression types is outside the scope of this
286 specification.
287
288 The implementations must support archive members being uncompressed,
289 and must support using different compression types for different files.
290
291 When compressing an archive member, the member filename should be
292 suffixed using the standard suffix for the particular compressed file
293 type (e.g. ``.bz2`` for bzip2 format).
294
295
296 OpenPGP member signatures
297 -------------------------
298
299 The archive members support optional OpenPGP signatures.
300 The implementations must allow the user to specify whether OpenPGP
301 signatures are to be expected in remotely fetched packages.
302
303 If the signatures are expected and the archive member is unsigned, the
304 package manager must reject processing it. If the signature does not
305 verify, the package manager must reject processing the corresponding
306 archive member. In particular, it must not attempt decompressing
307 compressed members in those circumstances.
308
309 The signatures are created as binary detached OpenPGP signature files,
310 with filename corresponding to the member filename with ``.sig`` suffix
311 appended.
312
313 The exact details regarding creating and verifying signatures, as well
314 as maintaining and distributing keys are outside the scope of this
315 specification.
316
317
318 Versioning and format recognition
319 ---------------------------------
320
321 The container format does not provide an explicit magic identifier
322 or version number. The implementations should recognize binary packages
323 through recognizing the uncompressed .tar archive format,
324 and investigating its contents. Generally, the presence of metadata
325 archive should be sufficient to assume that the package conforms to this
326 specification.
327
328 If the package format needs to be changed in incompatible way, it should
329 be done in such a way as to make the above check fail. For example,
330 the metadata archive can be renamed to ``metadata-r1.tar*``.
331
332
333 Rationale
334 =========
335
336 Nested archive format
337 ---------------------
338
339 The basic problem in designing the new format was how to embed multiple
340 data streams (metadata, image) into a single file. Traditionally, this
341 has been done via using two non-conflicting file formats. However,
342 while such a solution is clever, it suffers in terms of transparency.
343
344 Therefore, it has been established that the new format should really
345 consist of a single archive format, with all necessary data
346 transparently accessible inside the file. Consequently, it has been
347 debated how different parts of binary package data should be stored
348 inside that archive.
349
350 The proposal to continue storing image data as top-level data
351 in the package format, and store metadata as special directory in that
352 structure has been discarded as a case of in-band signalling.
353
354 Finally, the proposal has been shaped to store different kinds of data
355 as nested archives in the outer binary package container. Besides
356 providing a clean way of accessing different kinds of information, it
357 makes it possible to add separate OpenPGP signatures to them.
358
359
360 Inner vs. outer compression
361 ---------------------------
362
363 One of the points in the new format debate was whether the binary
364 package as a whole should be compressed vs. compressing individual
365 members. The first option may seem as an obvious choice, especially
366 given that with a larger data set, the compression may proceed more
367 effectively. However, it has a single strong disadvantage: compression
368 prevents random access and manipulation of the binary package members.
369
370 While for the purpose of reading binary packages, the problem could be
371 circumvented through convenient member ordering and avoiding disjoint
372 reads of the binary package, metadata updates would either require
373 recompressing the whole package (which could be really time consuming
374 with large packages) or applying complex techniques such as splitting
375 the compressed archive into multiple compressed streams.
376
377 This considered, the simplest solution is to apply compression to
378 the individual package members, while leaving the container format
379 uncompressed. It provides fast random access to the individual members,
380 as well as capability of updating them without the necessity of
381 recompressing other files in the container.
382
383 This also makes it possible to easily protect compressed files using
384 standard OpenPGP detached signature format. All this combined,
385 the package manager may perform partial fetch of binary package, verify
386 the signature of its metadata member and process it without having to
387 fetch the potentially-large image part.
388
389
390 Container and archive formats
391 -----------------------------
392
393 During the debate, the actual archive formats to use were considered.
394 The .tar format seemed an obvious choice for the image archive since
395 it is the only widely deployed archive format that stores all kinds
396 of file metadata on POSIX systems. However, multiple options for
397 the outer format has been debated.
398
399 Firstly, the ZIP format has been proposed as the only commonly supported
400 format supporting adding files from stdin (i.e. making it possible to
401 pipe the inner archives straight into the container without using
402 temporary files). However, this format has been clearly rejected
403 as both not being present in the system set, and being trailer-based
404 and therefore unusable without having to fetch the whole file.
405
406 Secondly, the ar and cpio formats were considered. The former is used
407 by Debian and its derivative binary packages; the latter is used by Red
408 Hat derivatives. Both formats have the advantage of having less
409 historical baggage than .tar, and having less overhead. However, both
410 are also rather obscure (especially given that ar is actually provided
411 by GNU binutils rather than as a stand-alone archiver), considered
412 obsolete by POSIX and both have file size limitations smaller than .tar.
413
414 Thirdly, SquashFS was another interesting option. Its main advantage is
415 transparent compression support and ability to mount as a filesystem.
416 However, it has a significant implementation complexity, including mount
417 management and necessity of fallback to unsquashfs. Since the image
418 needs to be writable for the pre-installation manipulations, using it
419 via a mount would additionally require some kind of overlay filesystem.
420 Using it as top-level format has no real gain over a pipeline with tar,
421 and is certainly less portable. Therefore, there does not seem to be
422 a benefit in using SquashFS.
423
424 All that considered, it has been decided that there is no purpose
425 in using a second archive format in the specification unless it has
426 significant advantage to .tar. Therefore, .tar has also been used
427 as outer package format, even though it has larger overhead than other
428 formats (mostly due to padding).
429
430
431 Member ordering
432 ---------------
433
434 The member ordering is explicitly specified in order to provide for
435 trivially reading metadata from partially fetched archives.
436 By requiring the metadata archive to be stored before the image archive,
437 the package manager may stop fetching after reading it and save
438 bandwidth and/or space.
439
440
441 Detached OpenPGP signatures
442 ---------------------------
443
444 The use of detached OpenPGP signatures is to provide authenticity checks
445 for binary packages. Covering the complete members with signatures
446 provide for trivial verification of all metadata and image contents
447 respectively, without having to invent custom mechanisms for combining
448 them. Covering the compressed archives helps to prevent zipbomb
449 attacks. Covering the individual members rather than the whole package
450 provides for verification of partially fetched binary packages.
451
452
453 Format versioning
454 -----------------
455
456 It has been requested that an explicit version identifier is added
457 into the binary package containers in order to account for possible
458 incompatible changes in the format. However, such an explicit notion
459 does not seem necessary.
460
461 Firstly, the format is meant to be extensible while preserving backwards
462 compatibility. If a backwards-incompatible change needs to be done,
463 and that change does not cause the packages implicitly incompatible
464 by design, the incompatibility can be easily forced e.g. via renaming
465 the metadata archive to ``metadata-r1.tar*``.
466
467 Secondly, the only really clean place for such a version would be
468 an additional file which would unnecessary grow the uncompressed
469 tarball. The label is non-obligatory and user-oriented, and as such can
470 not be used to carry information significant to the package manager.
471
472 Finally, such a version number can be added into the metadata archive
473 which needs to be processed by the package manager to extract all
474 significant binary package information.
475
476
477 Backwards Compatibility
478 =======================
479
480 The format does not preserve backwards compatibility with the tbz2
481 packages. It has been established that preserving compatibility with
482 the old format was impossible without making the new format even worse
483 than the old one was.
484
485 For example, adding any visible members to the tarball would cause
486 them to be installed to the filesystem by old Portage versions. Working
487 around this would require some kind of awful hacks that would oppose
488 the goal of using simple and transparent package format.
489
490
491 Reference Implementation
492 ========================
493
494 The proof-of-concept implementation of binary package format converter
495 is available as xpak2gpkg [#XPAK2GPKG]_. It can be used to easily
496 create packages in the new format for early inspection.
497
498
499 References
500 ==========
501
502 .. [#MAN-XPAK] xpak - The XPAK Data Format used with Portage binary
503 packages
504 (https://dev.gentoo.org/~zmedico/portage/doc/man/xpak.5.html)
505
506 .. [#PORTAGE-UTILS] portage-utils: Small and fast Portage helper tools
507 written in C
508 (https://packages.gentoo.org/packages/app-portage/portage-utils)
509
510 .. [#XPAK2GPKG] xpak2gpkg: Proof-of-concept converter from tbz2/xpak
511 to gpkg binpkg format
512 (https://github.com/mgorny/xpak2gpkg)
513
514
515 Copyright
516 =========
517 This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
518 Unported License. To view a copy of this license, visit
519 http://creativecommons.org/licenses/by-sa/3.0/.
520
521 --
522 Best regards,
523 Michał Górny

Attachments

File name MIME type
signature.asc application/pgp-signature

Replies

Subject Author
Re: [gentoo-dev] [pre-GLEP r2] Gentoo binary package container format Fabian Groffen <grobian@g.o>