Gentoo Archives: gentoo-commits

From: "Ulrich Müller" <ulm@g.o>
To: gentoo-commits@l.g.o
Subject: [gentoo-commits] data/glep:master commit in: /
Date: Sat, 08 Dec 2018 09:41:08
Message-Id: 1544261699.2795fb710f678f36e558db708fae7b248914f159.ulm@gentoo
1 commit: 2795fb710f678f36e558db708fae7b248914f159
2 Author: Michał Górny <mgorny <AT> gentoo <DOT> org>
3 AuthorDate: Sat Nov 17 11:17:11 2018 +0000
4 Commit: Ulrich Müller <ulm <AT> gentoo <DOT> org>
5 CommitDate: Sat Dec 8 09:34:59 2018 +0000
6 URL: https://gitweb.gentoo.org/data/glep.git/commit/?id=2795fb71
7
8 glep-0078: GLEP draft, 'Gentoo binary package container format'
9
10 Signed-off-by: Michał Górny <mgorny <AT> gentoo.org>
11 Signed-off-by: Ulrich Müller <ulm <AT> gentoo.org>
12 Bug: https://bugs.gentoo.org/672672
13
14 glep-0078.rst | 575 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
15 1 file changed, 575 insertions(+)
16
17 diff --git a/glep-0078.rst b/glep-0078.rst
18 new file mode 100644
19 index 0000000..edb4129
20 --- /dev/null
21 +++ b/glep-0078.rst
22 @@ -0,0 +1,575 @@
23 +---
24 +GLEP: 78
25 +Title: Gentoo binary package container format
26 +Author: Michał Górny <mgorny@g.o>
27 +Type: Standards Track
28 +Status: Draft
29 +Version: 1
30 +Created: 2018-11-15
31 +Last-Modified: 2018-11-30
32 +Post-History: 2018-11-17
33 +Content-Type: text/x-rst
34 +---
35 +
36 +Abstract
37 +========
38 +
39 +This GLEP proposes a new binary package container format for Gentoo.
40 +The current tbz2/XPAK format is shortly described, and its deficiences
41 +are explained. Accordingly, the requirements for a new format are set
42 +and a gpkg format satisfying them is proposed. The rationale for
43 +the design decisions is provided.
44 +
45 +
46 +Motivation
47 +==========
48 +
49 +The current Portage binary package format
50 +-----------------------------------------
51 +
52 +The historical ``.tbz2`` binary package format used by Portage is
53 +a concatenation of two distinct formats: header-oriented compressed .tar
54 +format (used to hold package files) and trailer-oriented custom XPAK
55 +format (used to hold metadata) [#MAN-XPAK]_. The format has already
56 +been extended incompatibly twice.
57 +
58 +The first time, support for storing multiple successive builds of binary
59 +package for a single ebuild version has been added. This feature relies
60 +on appending additional hyphen, followed by an integer to the package
61 +filename. It is disabled by default (preserving backwards
62 +compatibility) and controlled by ``binpkg-multi-instance`` feature.
63 +
64 +The second time, support for additional compression formats has been
65 +added. When format other than bzip2 is used, the ``.tbz2`` suffix
66 +is replaced by ``.xpak`` and Portage relies on magic bytes to detect
67 +compression used. For backwards compatibility, Portage still defaults
68 +to using bzip2; compression program can be switched using
69 +``BINPKG_COMPRESS`` configuration variable.
70 +
71 +Additionally, there have been minor changes to the stored metadata
72 +and file storage policies. In particular, behavior regarding
73 +``INSTALL_MASK``, controllable file compression and stripping has
74 +changed over time.
75 +
76 +
77 +The advantages of tbz2/XPAK format
78 +----------------------------------
79 +
80 +The tbz2/XPAK format used by Portage has three interesting features:
81 +
82 +1. **Each binary package is fully contained within a single file.**
83 + While this might seem unnecessary, it makes it easier for the user
84 + to transfer binary packages without having to be concerned about
85 + finding all the necessary files to transfer.
86 +
87 +2. **The binary packages are compatible with regular compressed
88 + tarballs, most of the time.** With notable exceptions of historical
89 + versions of pbzip2 and the recent zstd compressor, tbz2/XPAK packages
90 + can be extracted using regular tar utility with a compressor
91 + implementation that discards trailing garbage.
92 +
93 +3. **The metadata is uncompressed, and can be efficiently accessed
94 + without decompressing package contents.** This includes
95 + the possibility of rewriting it (e.g. as a result of package moves)
96 + without the necessity of repacking the files.
97 +
98 +
99 +Transparency problem with the current binary package format
100 +-----------------------------------------------------------
101 +
102 +Notwithstanding its advantages, the tbz2/XPAK format has a significant
103 +design fault that consists of two issues:
104 +
105 +1. **The XPAK format is a custom binary format with explicit use
106 + of binary-encoded file offsets and field lengths.** As such, it is
107 + non-trivial to read or edit without specialized tools. Such tools
108 + are currently implemented separately from the package manager,
109 + as part of the portage-utils toolkit, written in C [#PORTAGE-UTILS]_.
110 +
111 +2. **The tarball compatibility feature relies on obscure feature of
112 + ignoring trailing garbage in compressed files**. While this is
113 + implemented consistently in most of the compressors, this feature
114 + is not really a part of specification but rather traditional
115 + behavior. Given that the original reasons for this no longer apply,
116 + new compressor implementations are likely to miss support for this.
117 +
118 +Both of the issues make the format hard to use without dedicated tools,
119 +or when the tools misbehave. This impacts the following scenarios:
120 +
121 +A. **Using binary packages for system recovery.** In case of serious
122 + breakage, it is really preferable that the format depends on as few
123 + tools a possible, and especially not on Gentoo-specific tools.
124 +
125 +B. **Inspecting binary packages in detail exceeding standard package
126 + manager facilities.**
127 +
128 +C. **Modifying binary packages in ways not predicted by the package
129 + manager authors.** A real-life example of this is working around
130 + broken ``pkg_*`` phases which prevent the package from being
131 + installed.
132 +
133 +
134 +OpenPGP extensibility problem
135 +-----------------------------
136 +
137 +There are at least three obvious ways in which the current format could
138 +be extended to support OpenPGP signatures, and each of them has its own
139 +distinct problem:
140 +
141 +1. **Adding a detached signature.** This option is non-intrusive but
142 + causes the format to no longer be contained in a single file.
143 +
144 +2. **Wrapping the package in OpenPGP message format.** This would use
145 + a standard format and make verification and unpacking relatively
146 + easy. However, it would break backwards compatibility and add
147 + explicit dependency on OpenPGP implementation in order to unpack
148 + the package.
149 +
150 +3. **Adding OpenPGP signature as extra XPAK member.** This is
151 + the clever solution. It implies strengthening the dependency
152 + on custom tooling, now additionally necessary to extract
153 + the signature and reconstruct the original file to accommodate
154 + verification.
155 +
156 +
157 +Goals for a new container format
158 +--------------------------------
159 +
160 +All of the above considered, the new format should combine
161 +the advantages of the existing format and at the same time address its
162 +deficiencies whenever possible. Furthermore, since a format replacement
163 +is taking place it is worthwhile to consider additional goals that could
164 +be satisfied with little change.
165 +
166 +The following obligatory goals have been set for a replacement format:
167 +
168 +1. **The packages must remain contained in a single file.** As a matter
169 + of user convenience, it should be possible to transfer binary
170 + packages without having to use multiple files, and to install them
171 + from any location.
172 +
173 +2. **The file format must be entirely based on common file formats,
174 + respecting best practices, with as little customization as necessary
175 + to satisfy the requirements.** The format should be transparent
176 + enough to let user inspect and manipulate it without special tooling
177 + or detailed knowledge.
178 +
179 +3. **The file format must provide support for OpenPGP signatures.**
180 + Preferably, it should use standard OpenPGP message formats.
181 +
182 +4. **The file format must allow for efficient metadata updates.**
183 + In particular, it should be possible to update the metadata without
184 + having to recompress package files.
185 +
186 +Additionally, the following optional goals have been noted:
187 +
188 +A. **The file format should account for easy recognition both through
189 + filename and through contents.** Preferably, it should have distinct
190 + features making it possible to detect it via file(1).
191 +
192 +B. **The file format should provide for partial fetching of binary
193 + packages.** It should be possible to easily fetch and read
194 + the package metadata without having to download the whole package.
195 +
196 +C. **The file format should allow for metadata compression.**
197 +
198 +D. **The file format should make future extensions easily possible
199 + without breaking backwards compatibility.**
200 +
201 +
202 +Specification
203 +=============
204 +
205 +The container format
206 +--------------------
207 +
208 +The gpkg package container is an uncompressed .tar achive whose filename
209 +should use ``.gpkg.tar`` suffix.
210 +
211 +The archive contains a number of files, stored in a single directory
212 +whose name should match the basename of the package file. However,
213 +the implementation must be able to process an archive where
214 +the directory name is mismatched. There should be no explicit archive
215 +member entry for the directory.
216 +
217 +The package directory contains the following members, in order:
218 +
219 +1. The package format identifier file ``gpkg-1`` (required).
220 +
221 +2. A signature for the metadata archive: ``metadata.tar${comp}.sig``
222 + (optional).
223 +
224 +3. The metadata archive ``metadata.tar${comp}``, optionally compressed
225 + (required).
226 +
227 +4. A signature for the filesystem image archive:
228 + ``image.tar${comp}.sig`` (optional).
229 +
230 +5. The filesystem image archive ``image.tar${comp}``, optionally
231 + compressed (required).
232 +
233 +It is recommended that relative order of the archive members is
234 +preserved. However, implementations must support archives with members
235 +out of order.
236 +
237 +The container may be extended with additional members in the future.
238 +The implementations should ignore unrecognized members and preserve
239 +them across package updates.
240 +
241 +
242 +Permitted .tar format features
243 +------------------------------
244 +
245 +The tar archives should use either the POSIX ustar format or a subset
246 +of the GNU format with the following (optional) extensions:
247 +
248 +- long pathnames and long linknames,
249 +
250 +- base-256 encoding of large file sizes.
251 +
252 +Other extensions should be avoided whenever possible.
253 +
254 +
255 +The package identifier file
256 +---------------------------
257 +
258 +The package identifier file serves the purpose of identifying the binary
259 +package format and its version.
260 +
261 +The implementations must include a package identifier file named
262 +``gpkg-1``. The filename includes package format version;
263 +implementations should reject packages which do not contain this file
264 +as unsupported format.
265 +
266 +The file can have any contents. Normally, it should be empty.
267 +
268 +Furthermore, this file should be included in the .tar archive
269 +as the first member. This makes it possible to use it as an additional
270 +magic at a fixed location that can be used by tools such as file(1)
271 +to easily distinguish Gentoo binary packages from regular .tar archives.
272 +
273 +
274 +The metadata archive
275 +--------------------
276 +
277 +The metadata archive stores the package metadata needed for the package
278 +manager to process it. The archive should be included at the beginning
279 +of the binary package in order to make it possible to read it out of
280 +partially fetched binary package, and to avoid fetching the remaining
281 +part of the package if not necessary.
282 +
283 +The archive contains a single directory called ``metadata``. In this
284 +directory, the individual metadata keys are stored as files. The exact
285 +keys and metadata format is outside the scope of this specification.
286 +
287 +The package manager may need to modify the package metadata. In this
288 +case, it should replace the metadata archive without having to alter
289 +other package members.
290 +
291 +The metadata archive can optionally be compressed. It can also be
292 +supplemented with a detached OpenPGP signature.
293 +
294 +
295 +The image archive
296 +-----------------
297 +
298 +The image archive stores all the files to be installed by the binary
299 +package. It should be included as the last of the files in the binary
300 +package container.
301 +
302 +The archive contains a single directory called ``image``. Inside this
303 +directory, all package files are stored in filesystem layout, relative
304 +to the root directory.
305 +
306 +The image archive can optionally be compressed. It can also be
307 +supplemented with a detached OpenPGP signature.
308 +
309 +
310 +Archive member compression
311 +--------------------------
312 +
313 +The archive members outlined above support optional compression using
314 +one of the compressed file formats supported by the package manager.
315 +The exact list of compression types is outside the scope of this
316 +specification.
317 +
318 +The implementations must support archive members being uncompressed,
319 +and must support using different compression types for different files.
320 +
321 +When compressing an archive member, the member filename should be
322 +suffixed using the standard suffix for the particular compressed file
323 +type (e.g. ``.bz2`` for bzip2 format).
324 +
325 +
326 +OpenPGP member signatures
327 +-------------------------
328 +
329 +The archive members support optional OpenPGP signatures.
330 +The implementations must allow the user to specify whether OpenPGP
331 +signatures are to be expected in remotely fetched packages.
332 +
333 +If the signatures are expected and the archive member is unsigned, the
334 +package manager must reject processing it. If the signature does not
335 +verify, the package manager must reject processing the corresponding
336 +archive member. In particular, it must not attempt decompressing
337 +compressed members in those circumstances.
338 +
339 +The signatures are created as binary detached OpenPGP signature files,
340 +with filename corresponding to the member filename with ``.sig`` suffix
341 +appended.
342 +
343 +The exact details regarding creating and verifying signatures, as well
344 +as maintaining and distributing keys are outside the scope of this
345 +specification.
346 +
347 +
348 +Rationale
349 +=========
350 +
351 +Package formats used by other distributions
352 +-------------------------------------------
353 +
354 +The research on the new package format included investigating
355 +the possibility of reusing solutions from other operating system
356 +distributions. While reusing a foreign package format would be
357 +interesting, the differences in Gentoo metadata structure would prevent
358 +any real compatibility. Some degree of compatibility might be achieved
359 +through adapting the Gentoo metadata, however the costs of such
360 +a solution would probably outweigh its usefulness.
361 +
362 +Debian and its derivates are using the .deb package format. This is
363 +a nested archive format, with the outer archive being of ar format,
364 +and containing nested tarballs of control information (metadata)
365 +and data [#DEB-FORMAT]_.
366 +
367 +Red Hat, its derivates and some less related distributions are using
368 +the RPM format. It is a custom binary format, storing metadata directly
369 +and using a trailer cpio archive to store package files.
370 +
371 +Arch Linux is using xz-compressed tarballs (suffixed ``.pkg.tar.xz``)
372 +as its binary package format. The tarballs contain package files
373 +on top-level, with specially named dotfiles used for package metadata.
374 +OpenPGP signatures are stored as detached ``.sig`` files alongside
375 +packages.
376 +
377 +Exherbo is using the pbins format. In this format, the binary package
378 +metadata is stored in repository alike ebuilds, and the binary package
379 +files are stored separately and downloaded alike source tarballs.
380 +
381 +
382 +Nested archive format
383 +---------------------
384 +
385 +The basic problem in designing the new format was how to embed multiple
386 +data streams (metadata, image) into a single file. Traditionally, this
387 +has been done via using two non-conflicting file formats. However,
388 +while such a solution is clever, it suffers in terms of transparency.
389 +
390 +Therefore, it has been established that the new format should really
391 +consist of a single archive format, with all necessary data
392 +transparently accessible inside the file. Consequently, it has been
393 +debated how different parts of binary package data should be stored
394 +inside that archive.
395 +
396 +The proposal to continue storing image data as top-level data
397 +in the package format, and store metadata as special directory in that
398 +structure has been discarded as a case of in-band signalling.
399 +
400 +Finally, the proposal has been shaped to store different kinds of data
401 +as nested archives in the outer binary package container. Besides
402 +providing a clean way of accessing different kinds of information, it
403 +makes it possible to add separate OpenPGP signatures to them.
404 +
405 +
406 +Inner vs. outer compression
407 +---------------------------
408 +
409 +One of the points in the new format debate was whether the binary
410 +package as a whole should be compressed vs. compressing individual
411 +members. The first option may seem as an obvious choice, especially
412 +given that with a larger data set, the compression may proceed more
413 +effectively. However, it has a single strong disadvantage: compression
414 +prevents random access and manipulation of the binary package members.
415 +
416 +While for the purpose of reading binary packages, the problem could be
417 +circumvented through convenient member ordering and avoiding disjoint
418 +reads of the binary package, metadata updates would either require
419 +recompressing the whole package (which could be really time consuming
420 +with large packages) or applying complex techniques such as splitting
421 +the compressed archive into multiple compressed streams.
422 +
423 +This considered, the simplest solution is to apply compression to
424 +the individual package members, while leaving the container format
425 +uncompressed. It provides fast random access to the individual members,
426 +as well as capability of updating them without the necessity of
427 +recompressing other files in the container.
428 +
429 +This also makes it possible to easily protect compressed files using
430 +standard OpenPGP detached signature format. All this combined,
431 +the package manager may perform partial fetch of binary package, verify
432 +the signature of its metadata member and process it without having to
433 +fetch the potentially-large image part.
434 +
435 +
436 +Container and archive formats
437 +-----------------------------
438 +
439 +During the debate, the actual archive formats to use were considered.
440 +The .tar format seemed an obvious choice for the image archive since
441 +it is the only widely deployed archive format that stores all kinds
442 +of file metadata on POSIX systems. However, multiple options for
443 +the outer format has been debated.
444 +
445 +Firstly, the ZIP format has been proposed as the only commonly supported
446 +format supporting adding files from stdin (i.e. making it possible to
447 +pipe the inner archives straight into the container without using
448 +temporary files). However, this format has been clearly rejected
449 +as both not being present in the system set, and being trailer-based
450 +and therefore unusable without having to fetch the whole file.
451 +
452 +Secondly, the ar and cpio formats were considered. The former is used
453 +by Debian and its derivative binary packages; the latter is used by Red
454 +Hat derivatives. Both formats have the advantage of having less
455 +historical baggage than .tar, and having less overhead. However, both
456 +are also rather obscure (especially given that ar is actually provided
457 +by GNU binutils rather than as a stand-alone archiver), considered
458 +obsolete by POSIX and both have file size limitations smaller than .tar.
459 +
460 +Thirdly, SquashFS was another interesting option. Its main advantage is
461 +transparent compression support and ability to mount as a filesystem.
462 +However, it has a significant implementation complexity, including mount
463 +management and necessity of fallback to unsquashfs. Since the image
464 +needs to be writable for the pre-installation manipulations, using it
465 +via a mount would additionally require some kind of overlay filesystem.
466 +Using it as top-level format has no real gain over a pipeline with tar,
467 +and is certainly less portable. Therefore, there does not seem to be
468 +a benefit in using SquashFS.
469 +
470 +All that considered, it has been decided that there is no purpose
471 +in using a second archive format in the specification unless it has
472 +significant advantage to .tar. Therefore, .tar has also been used
473 +as outer package format, even though it has larger overhead than other
474 +formats (mostly due to padding).
475 +
476 +
477 +.tar portability issues
478 +-----------------------
479 +
480 +The modern .tar dialects could be considered dirty extensions
481 +of the original .tar format. Three variants may be considered
482 +of interest: POSIX ustar, pax (newer POSIX standard) and GNU tar.
483 +All three formats are supported by GNU tar, whose presence on systems
484 +used to create binary packages could be relied on. Therefore,
485 +the portability concerns are related mostly to being able to read
486 +and modify binary packages in scenarios of GNU tar being unavailable.
487 +
488 +For the purpose of this specification, detailed research on portability
489 +of individual tar features has been conducted. The research concluded:
490 +
491 + Judging by the test results, the most portability could be
492 + achieved by:
493 +
494 + - using strict POSIX ustar format whenever possible,
495 +
496 + - using GNU format for long paths (that do not fit in ustar format),
497 +
498 + - using base-256 (+ pax if already used) encoding for large files,
499 +
500 + - using pax (+ octal or base-256) for high-range/precision
501 + timestamps and user/group identifiers,
502 +
503 + - using pax attributes for extended metadata and/or volume label.
504 +
505 +It has been determined that for the purpose of binary package we really
506 +only need to be concerned about long paths and huge files. Therefore,
507 +the above was limited to the three first points and a guideline was
508 +formed from them.
509 +
510 +Debian has a similar guideline for the inner tar of their package
511 +format [#DEB-FORMAT]_.
512 +
513 +
514 +Member ordering
515 +---------------
516 +
517 +The member ordering is explicitly specified in order to provide for
518 +trivially reading metadata from partially fetched archives.
519 +By requiring the metadata archive to be stored before the image archive,
520 +the package manager may stop fetching after reading it and save
521 +bandwidth and/or space.
522 +
523 +
524 +Detached OpenPGP signatures
525 +---------------------------
526 +
527 +The use of detached OpenPGP signatures is to provide authenticity checks
528 +for binary packages. Covering the complete members with signatures
529 +provide for trivial verification of all metadata and image contents
530 +respectively, without having to invent custom mechanisms for combining
531 +them. Covering the compressed archives helps to prevent zipbomb
532 +attacks. Covering the individual members rather than the whole package
533 +provides for verification of partially fetched binary packages.
534 +
535 +
536 +Format versioning
537 +-----------------
538 +
539 +The format is versioned through an explicit file, with the version
540 +stored in the filename. If the format changes incompatibly,
541 +the filename changes and old implementations do not recognize it
542 +as a valid package.
543 +
544 +Previously, the format tried to avoid an explicit file for this purpose
545 +and used volume label instead. However, the use of label has been
546 +renounced due to unforeseen portability issues.
547 +
548 +
549 +Backwards Compatibility
550 +=======================
551 +
552 +The format does not preserve backwards compatibility with the tbz2
553 +packages. It has been established that preserving compatibility with
554 +the old format was impossible without making the new format even worse
555 +than the old one was.
556 +
557 +For example, adding any visible members to the tarball would cause
558 +them to be installed to the filesystem by old Portage versions. Working
559 +around this would require some kind of awful hacks that would oppose
560 +the goal of using simple and transparent package format.
561 +
562 +
563 +Reference Implementation
564 +========================
565 +
566 +The proof-of-concept implementation of binary package format converter
567 +is available as xpak2gpkg [#XPAK2GPKG]_. It can be used to easily
568 +create packages in the new format for early inspection.
569 +
570 +
571 +References
572 +==========
573 +
574 +.. [#MAN-XPAK] xpak - The XPAK Data Format used with Portage binary
575 + packages
576 + (https://dev.gentoo.org/~zmedico/portage/doc/man/xpak.5.html)
577 +
578 +.. [#PORTAGE-UTILS] portage-utils: Small and fast Portage helper tools
579 + written in C
580 + (https://packages.gentoo.org/packages/app-portage/portage-utils)
581 +
582 +.. [#DEB-FORMAT] deb(5) — Debian binary package format
583 + (https://manpages.debian.org/unstable/dpkg-dev/deb.5.en.html)
584 +
585 +.. [#TAR-PORTABILITY] Michał Górny, Portability of tar features
586 + (https://dev.gentoo.org/~mgorny/articles/portability-of-tar-features.html)
587 +
588 +.. [#XPAK2GPKG] xpak2gpkg: Proof-of-concept converter from tbz2/xpak
589 + to gpkg binpkg format
590 + (https://github.com/mgorny/xpak2gpkg)
591 +
592 +
593 +Copyright
594 +=========
595 +This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
596 +Unported License. To view a copy of this license, visit
597 +http://creativecommons.org/licenses/by-sa/3.0/.