Gentoo Archives: gentoo-dev

From: "Michał Górny" <mgorny@g.o>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] [pre-GLEP r4] Gentoo binary package container format
Date: Fri, 30 Nov 2018 17:10:05
Message-Id: 1543597794.20082.6.camel@gentoo.org
In Reply to: [gentoo-dev] [pre-GLEP] Gentoo binary package container format by "Michał Górny"
1 Hi,
2
3 Here's hopefully the last update for some time (that is, before I get to
4 working on implementation). There are two small changes:
5
6 - clarified the text on top archive directory: mentioned it shouldn't
7 have an explicit member in the archive and that the implementations
8 should be ready to handle mismatched directory name (i.e. when archive
9 ends up being renamed),
10
11 - removed .txt suffix from 'gpkg-1' package identifier file.
12
13
14 ---
15 GLEP: 9999
16 Title: Gentoo binary package container format
17 Author: Michał Górny <mgorny@g.o>
18 Type: Standards Track
19 Status: Draft
20 Version: 1
21 Created: 2018-11-15
22 Last-Modified: 2018-11-30
23 Post-History: 2018-11-17
24 Content-Type: text/x-rst
25 ---
26
27 Abstract
28 ========
29
30 This GLEP proposes a new binary package container format for Gentoo.
31 The current tbz2/XPAK format is shortly described, and its deficiences
32 are explained. Accordingly, the requirements for a new format are set
33 and a gpkg format satisfying them is proposed. The rationale for
34 the design decisions is provided.
35
36
37 Motivation
38 ==========
39
40 The current Portage binary package format
41 -----------------------------------------
42
43 The historical ``.tbz2`` binary package format used by Portage is
44 a concatenation of two distinct formats: header-oriented compressed .tar
45 format (used to hold package files) and trailer-oriented custom XPAK
46 format (used to hold metadata) [#MAN-XPAK]_. The format has already
47 been extended incompatibly twice.
48
49 The first time, support for storing multiple successive builds of binary
50 package for a single ebuild version has been added. This feature relies
51 on appending additional hyphen, followed by an integer to the package
52 filename. It is disabled by default (preserving backwards
53 compatibility) and controlled by ``binpkg-multi-instance`` feature.
54
55 The second time, support for additional compression formats has been
56 added. When format other than bzip2 is used, the ``.tbz2`` suffix
57 is replaced by ``.xpak`` and Portage relies on magic bytes to detect
58 compression used. For backwards compatibility, Portage still defaults
59 to using bzip2; compression program can be switched using
60 ``BINPKG_COMPRESS`` configuration variable.
61
62 Additionally, there have been minor changes to the stored metadata
63 and file storage policies. In particular, behavior regarding
64 ``INSTALL_MASK``, controllable file compression and stripping has
65 changed over time.
66
67
68 The advantages of tbz2/XPAK format
69 ----------------------------------
70
71 The tbz2/XPAK format used by Portage has three interesting features:
72
73 1. **Each binary package is fully contained within a single file.**
74 While this might seem unnecessary, it makes it easier for the user
75 to transfer binary packages without having to be concerned about
76 finding all the necessary files to transfer.
77
78 2. **The binary packages are compatible with regular compressed
79 tarballs, most of the time.** With notable exceptions of historical
80 versions of pbzip2 and the recent zstd compressor, tbz2/XPAK packages
81 can be extracted using regular tar utility with a compressor
82 implementation that discards trailing garbage.
83
84 3. **The metadata is uncompressed, and can be efficiently accessed
85 without decompressing package contents.** This includes
86 the possibility of rewriting it (e.g. as a result of package moves)
87 without the necessity of repacking the files.
88
89
90 Transparency problem with the current binary package format
91 -----------------------------------------------------------
92
93 Notwithstanding its advantages, the tbz2/XPAK format has a significant
94 design fault that consists of two issues:
95
96 1. **The XPAK format is a custom binary format with explicit use
97 of binary-encoded file offsets and field lengths.** As such, it is
98 non-trivial to read or edit without specialized tools. Such tools
99 are currently implemented separately from the package manager,
100 as part of the portage-utils toolkit, written in C [#PORTAGE-UTILS]_.
101
102 2. **The tarball compatibility feature relies on obscure feature of
103 ignoring trailing garbage in compressed files**. While this is
104 implemented consistently in most of the compressors, this feature
105 is not really a part of specification but rather traditional
106 behavior. Given that the original reasons for this no longer apply,
107 new compressor implementations are likely to miss support for this.
108
109 Both of the issues make the format hard to use without dedicated tools,
110 or when the tools misbehave. This impacts the following scenarios:
111
112 A. **Using binary packages for system recovery.** In case of serious
113 breakage, it is really preferable that the format depends on as few
114 tools a possible, and especially not on Gentoo-specific tools.
115
116 B. **Inspecting binary packages in detail exceeding standard package
117 manager facilities.**
118
119 C. **Modifying binary packages in ways not predicted by the package
120 manager authors.** A real-life example of this is working around
121 broken ``pkg_*`` phases which prevent the package from being
122 installed.
123
124
125 OpenPGP extensibility problem
126 -----------------------------
127
128 There are at least three obvious ways in which the current format could
129 be extended to support OpenPGP signatures, and each of them has its own
130 distinct problem:
131
132 1. **Adding a detached signature.** This option is non-intrusive but
133 causes the format to no longer be contained in a single file.
134
135 2. **Wrapping the package in OpenPGP message format.** This would use
136 a standard format and make verification and unpacking relatively
137 easy. However, it would break backwards compatibility and add
138 explicit dependency on OpenPGP implementation in order to unpack
139 the package.
140
141 3. **Adding OpenPGP signature as extra XPAK member.** This is
142 the clever solution. It implies strengthening the dependency
143 on custom tooling, now additionally necessary to extract
144 the signature and reconstruct the original file to accommodate
145 verification.
146
147
148 Goals for a new container format
149 --------------------------------
150
151 All of the above considered, the new format should combine
152 the advantages of the existing format and at the same time address its
153 deficiencies whenever possible. Furthermore, since a format replacement
154 is taking place it is worthwhile to consider additional goals that could
155 be satisfied with little change.
156
157 The following obligatory goals have been set for a replacement format:
158
159 1. **The packages must remain contained in a single file.** As a matter
160 of user convenience, it should be possible to transfer binary
161 packages without having to use multiple files, and to install them
162 from any location.
163
164 2. **The file format must be entirely based on common file formats,
165 respecting best practices, with as little customization as necessary
166 to satisfy the requirements.** The format should be transparent
167 enough to let user inspect and manipulate it without special tooling
168 or detailed knowledge.
169
170 3. **The file format must provide support for OpenPGP signatures.**
171 Preferably, it should use standard OpenPGP message formats.
172
173 4. **The file format must allow for efficient metadata updates.**
174 In particular, it should be possible to update the metadata without
175 having to recompress package files.
176
177 Additionally, the following optional goals have been noted:
178
179 A. **The file format should account for easy recognition both through
180 filename and through contents.** Preferably, it should have distinct
181 features making it possible to detect it via file(1).
182
183 B. **The file format should provide for partial fetching of binary
184 packages.** It should be possible to easily fetch and read
185 the package metadata without having to download the whole package.
186
187 C. **The file format should allow for metadata compression.**
188
189 D. **The file format should make future extensions easily possible
190 without breaking backwards compatibility.**
191
192
193 Specification
194 =============
195
196 The container format
197 --------------------
198
199 The gpkg package container is an uncompressed .tar achive whose filename
200 should use ``.gpkg.tar`` suffix.
201
202 The archive contains a number of files, stored in a single directory
203 whose name should match the basename of the package file. However,
204 the implementation must be able to process an archive where
205 the directory name is mismatched. There should be no explicit archive
206 member entry for the directory.
207
208 The package directory contains the following members, in order:
209
210 1. The package format identifier file ``gpkg-1`` (required).
211
212 2. A signature for the metadata archive: ``metadata.tar${comp}.sig``
213 (optional).
214
215 3. The metadata archive ``metadata.tar${comp}``, optionally compressed
216 (required).
217
218 4. A signature for the filesystem image archive:
219 ``image.tar${comp}.sig`` (optional).
220
221 5. The filesystem image archive ``image.tar${comp}``, optionally
222 compressed (required).
223
224 It is recommended that relative order of the archive members is
225 preserved. However, implementations must support archives with members
226 out of order.
227
228 The container may be extended with additional members in the future.
229 The implementations should ignore unrecognized members and preserve
230 them across package updates.
231
232
233 Permitted .tar format features
234 ------------------------------
235
236 The tar archives should use either the POSIX ustar format or a subset
237 of the GNU format with the following (optional) extensions:
238
239 - long pathnames and long linknames,
240
241 - base-256 encoding of large file sizes.
242
243 Other extensions should be avoided whenever possible.
244
245
246 The package identifier file
247 ---------------------------
248
249 The package identifier file serves the purpose of identifying the binary
250 package format and its version.
251
252 The implementations must include a package identifier file named
253 ``gpkg-1``. The filename includes package format version;
254 implementations should reject packages which do not contain this file
255 as unsupported format.
256
257 The file can have any contents. Normally, it should be empty.
258
259 Furthermore, this file should be included in the .tar archive
260 as the first member. This makes it possible to use it as an additional
261 magic at a fixed location that can be used by tools such as file(1)
262 to easily distinguish Gentoo binary packages from regular .tar archives.
263
264
265 The metadata archive
266 --------------------
267
268 The metadata archive stores the package metadata needed for the package
269 manager to process it. The archive should be included at the beginning
270 of the binary package in order to make it possible to read it out of
271 partially fetched binary package, and to avoid fetching the remaining
272 part of the package if not necessary.
273
274 The archive contains a single directory called ``metadata``. In this
275 directory, the individual metadata keys are stored as files. The exact
276 keys and metadata format is outside the scope of this specification.
277
278 The package manager may need to modify the package metadata. In this
279 case, it should replace the metadata archive without having to alter
280 other package members.
281
282 The metadata archive can optionally be compressed. It can also be
283 supplemented with a detached OpenPGP signature.
284
285
286 The image archive
287 -----------------
288
289 The image archive stores all the files to be installed by the binary
290 package. It should be included as the last of the files in the binary
291 package container.
292
293 The archive contains a single directory called ``image``. Inside this
294 directory, all package files are stored in filesystem layout, relative
295 to the root directory.
296
297 The image archive can optionally be compressed. It can also be
298 supplemented with a detached OpenPGP signature.
299
300
301 Archive member compression
302 --------------------------
303
304 The archive members outlined above support optional compression using
305 one of the compressed file formats supported by the package manager.
306 The exact list of compression types is outside the scope of this
307 specification.
308
309 The implementations must support archive members being uncompressed,
310 and must support using different compression types for different files.
311
312 When compressing an archive member, the member filename should be
313 suffixed using the standard suffix for the particular compressed file
314 type (e.g. ``.bz2`` for bzip2 format).
315
316
317 OpenPGP member signatures
318 -------------------------
319
320 The archive members support optional OpenPGP signatures.
321 The implementations must allow the user to specify whether OpenPGP
322 signatures are to be expected in remotely fetched packages.
323
324 If the signatures are expected and the archive member is unsigned, the
325 package manager must reject processing it. If the signature does not
326 verify, the package manager must reject processing the corresponding
327 archive member. In particular, it must not attempt decompressing
328 compressed members in those circumstances.
329
330 The signatures are created as binary detached OpenPGP signature files,
331 with filename corresponding to the member filename with ``.sig`` suffix
332 appended.
333
334 The exact details regarding creating and verifying signatures, as well
335 as maintaining and distributing keys are outside the scope of this
336 specification.
337
338
339 Rationale
340 =========
341
342 Package formats used by other distributions
343 -------------------------------------------
344
345 The research on the new package format included investigating
346 the possibility of reusing solutions from other operating system
347 distributions. While reusing a foreign package format would be
348 interesting, the differences in Gentoo metadata structure would prevent
349 any real compatibility. Some degree of compatibility might be achieved
350 through adapting the Gentoo metadata, however the costs of such
351 a solution would probably outweigh its usefulness.
352
353 Debian and its derivates are using the .deb package format. This is
354 a nested archive format, with the outer archive being of ar format,
355 and containing nested tarballs of control information (metadata)
356 and data [#DEB-FORMAT]_.
357
358 Red Hat, its derivates and some less related distributions are using
359 the RPM format. It is a custom binary format, storing metadata directly
360 and using a trailer cpio archive to store package files.
361
362 Arch Linux is using xz-compressed tarballs (suffixed ``.pkg.tar.xz``)
363 as its binary package format. The tarballs contain package files
364 on top-level, with specially named dotfiles used for package metadata.
365 OpenPGP signatures are stored as detached ``.sig`` files alongside
366 packages.
367
368 Exherbo is using the pbins format. In this format, the binary package
369 metadata is stored in repository alike ebuilds, and the binary package
370 files are stored separately and downloaded alike source tarballs.
371
372
373 Nested archive format
374 ---------------------
375
376 The basic problem in designing the new format was how to embed multiple
377 data streams (metadata, image) into a single file. Traditionally, this
378 has been done via using two non-conflicting file formats. However,
379 while such a solution is clever, it suffers in terms of transparency.
380
381 Therefore, it has been established that the new format should really
382 consist of a single archive format, with all necessary data
383 transparently accessible inside the file. Consequently, it has been
384 debated how different parts of binary package data should be stored
385 inside that archive.
386
387 The proposal to continue storing image data as top-level data
388 in the package format, and store metadata as special directory in that
389 structure has been discarded as a case of in-band signalling.
390
391 Finally, the proposal has been shaped to store different kinds of data
392 as nested archives in the outer binary package container. Besides
393 providing a clean way of accessing different kinds of information, it
394 makes it possible to add separate OpenPGP signatures to them.
395
396
397 Inner vs. outer compression
398 ---------------------------
399
400 One of the points in the new format debate was whether the binary
401 package as a whole should be compressed vs. compressing individual
402 members. The first option may seem as an obvious choice, especially
403 given that with a larger data set, the compression may proceed more
404 effectively. However, it has a single strong disadvantage: compression
405 prevents random access and manipulation of the binary package members.
406
407 While for the purpose of reading binary packages, the problem could be
408 circumvented through convenient member ordering and avoiding disjoint
409 reads of the binary package, metadata updates would either require
410 recompressing the whole package (which could be really time consuming
411 with large packages) or applying complex techniques such as splitting
412 the compressed archive into multiple compressed streams.
413
414 This considered, the simplest solution is to apply compression to
415 the individual package members, while leaving the container format
416 uncompressed. It provides fast random access to the individual members,
417 as well as capability of updating them without the necessity of
418 recompressing other files in the container.
419
420 This also makes it possible to easily protect compressed files using
421 standard OpenPGP detached signature format. All this combined,
422 the package manager may perform partial fetch of binary package, verify
423 the signature of its metadata member and process it without having to
424 fetch the potentially-large image part.
425
426
427 Container and archive formats
428 -----------------------------
429
430 During the debate, the actual archive formats to use were considered.
431 The .tar format seemed an obvious choice for the image archive since
432 it is the only widely deployed archive format that stores all kinds
433 of file metadata on POSIX systems. However, multiple options for
434 the outer format has been debated.
435
436 Firstly, the ZIP format has been proposed as the only commonly supported
437 format supporting adding files from stdin (i.e. making it possible to
438 pipe the inner archives straight into the container without using
439 temporary files). However, this format has been clearly rejected
440 as both not being present in the system set, and being trailer-based
441 and therefore unusable without having to fetch the whole file.
442
443 Secondly, the ar and cpio formats were considered. The former is used
444 by Debian and its derivative binary packages; the latter is used by Red
445 Hat derivatives. Both formats have the advantage of having less
446 historical baggage than .tar, and having less overhead. However, both
447 are also rather obscure (especially given that ar is actually provided
448 by GNU binutils rather than as a stand-alone archiver), considered
449 obsolete by POSIX and both have file size limitations smaller than .tar.
450
451 Thirdly, SquashFS was another interesting option. Its main advantage is
452 transparent compression support and ability to mount as a filesystem.
453 However, it has a significant implementation complexity, including mount
454 management and necessity of fallback to unsquashfs. Since the image
455 needs to be writable for the pre-installation manipulations, using it
456 via a mount would additionally require some kind of overlay filesystem.
457 Using it as top-level format has no real gain over a pipeline with tar,
458 and is certainly less portable. Therefore, there does not seem to be
459 a benefit in using SquashFS.
460
461 All that considered, it has been decided that there is no purpose
462 in using a second archive format in the specification unless it has
463 significant advantage to .tar. Therefore, .tar has also been used
464 as outer package format, even though it has larger overhead than other
465 formats (mostly due to padding).
466
467
468 .tar portability issues
469 -----------------------
470
471 The modern .tar dialects could be considered dirty extensions
472 of the original .tar format. Three variants may be considered
473 of interest: POSIX ustar, pax (newer POSIX standard) and GNU tar.
474 All three formats are supported by GNU tar, whose presence on systems
475 used to create binary packages could be relied on. Therefore,
476 the portability concerns are related mostly to being able to read
477 and modify binary packages in scenarios of GNU tar being unavailable.
478
479 For the purpose of this specification, detailed research on portability
480 of individual tar features has been conducted. The research concluded:
481
482 Judging by the test results, the most portability could be
483 achieved by:
484
485 - using strict POSIX ustar format whenever possible,
486
487 - using GNU format for long paths (that do not fix in ustar format),
488
489 - using base-256 (+ pax if already used) encoding for large files,
490
491 - using pax (+ octal or base-256) for high-range/precision
492 timestamps and user/group identifiers,
493
494 - using pax attributes for extended metadata and/or volume label.
495
496 It has been determined that for the purpose of binary package we really
497 only need to be concerned about long paths and huge files. Therefore,
498 the above was limited to the three first points and a guideline was
499 formed from them.
500
501 Debian has a similar guideline for the inner tar of their package
502 format [#DEB-FORMAT]_.
503
504
505 Member ordering
506 ---------------
507
508 The member ordering is explicitly specified in order to provide for
509 trivially reading metadata from partially fetched archives.
510 By requiring the metadata archive to be stored before the image archive,
511 the package manager may stop fetching after reading it and save
512 bandwidth and/or space.
513
514
515 Detached OpenPGP signatures
516 ---------------------------
517
518 The use of detached OpenPGP signatures is to provide authenticity checks
519 for binary packages. Covering the complete members with signatures
520 provide for trivial verification of all metadata and image contents
521 respectively, without having to invent custom mechanisms for combining
522 them. Covering the compressed archives helps to prevent zipbomb
523 attacks. Covering the individual members rather than the whole package
524 provides for verification of partially fetched binary packages.
525
526
527 Format versioning
528 -----------------
529
530 The format is versioned through an explicit file, with the version
531 stored in the filename. If the format changes incompatibly,
532 the filename changes and old implementations do not recognize it
533 as a valid package.
534
535 Previously, the format tried to avoid an explicit file for this purpose
536 and used volume label instead. However, the use of label has been
537 renounced due to unforeseen portability issues.
538
539
540 Backwards Compatibility
541 =======================
542
543 The format does not preserve backwards compatibility with the tbz2
544 packages. It has been established that preserving compatibility with
545 the old format was impossible without making the new format even worse
546 than the old one was.
547
548 For example, adding any visible members to the tarball would cause
549 them to be installed to the filesystem by old Portage versions. Working
550 around this would require some kind of awful hacks that would oppose
551 the goal of using simple and transparent package format.
552
553
554 Reference Implementation
555 ========================
556
557 The proof-of-concept implementation of binary package format converter
558 is available as xpak2gpkg [#XPAK2GPKG]_. It can be used to easily
559 create packages in the new format for early inspection.
560
561
562 References
563 ==========
564
565 .. [#MAN-XPAK] xpak - The XPAK Data Format used with Portage binary
566 packages
567 (https://dev.gentoo.org/~zmedico/portage/doc/man/xpak.5.html)
568
569 .. [#PORTAGE-UTILS] portage-utils: Small and fast Portage helper tools
570 written in C
571 (https://packages.gentoo.org/packages/app-portage/portage-utils)
572
573 .. [#DEB-FORMAT] deb(5) — Debian binary package format
574 (https://manpages.debian.org/unstable/dpkg-dev/deb.5.en.html)
575
576 .. [#TAR-PORTABILITY] Michał Górny, Portability of tar features
577 (https://dev.gentoo.org/~mgorny/articles/portability-of-tar-features.html)
578
579 .. [#XPAK2GPKG] xpak2gpkg: Proof-of-concept converter from tbz2/xpak
580 to gpkg binpkg format
581 (https://github.com/mgorny/xpak2gpkg)
582
583
584 Copyright
585 =========
586 This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
587 Unported License. To view a copy of this license, visit
588 http://creativecommons.org/licenses/by-sa/3.0/.
589
590 --
591 Best regards,
592 Michał Górny

Attachments

File name MIME type
signature.asc application/pgp-signature

Replies