Gentoo Archives: gentoo-dev

From: "Michał Górny" <mgorny@g.o>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] [pre-GLEP r3] Gentoo binary package container format
Date: Mon, 26 Nov 2018 18:58:26
Message-Id: 1543258696.24857.4.camel@gentoo.org
In Reply to: [gentoo-dev] [pre-GLEP] Gentoo binary package container format by "Michał Górny"
1 Here's the newest version.
2
3 Changes:
4
5 - added explicit notion of parent directory (missing in previous GLEP
6 but present in implementation),
7
8 - explicitly named GNU tar format with list of permitted extensions,
9
10 - changed volume label to 'gpkg-1.txt' file to improve portability; made
11 it explicit version identifier as well,
12
13 - added info on other package formats to rationale.
14
15
16 ---
17 GLEP: 9999
18 Title: Gentoo binary package container format
19 Author: Michał Górny <mgorny@g.o>
20 Type: Standards Track
21 Status: Draft
22 Version: 1
23 Created: 2018-11-15
24 Last-Modified: 2018-11-26
25 Post-History: 2018-11-17
26 Content-Type: text/x-rst
27 ---
28
29 Abstract
30 ========
31
32 This GLEP proposes a new binary package container format for Gentoo.
33 The current tbz2/XPAK format is shortly described, and its deficiences
34 are explained. Accordingly, the requirements for a new format are set
35 and a gpkg format satisfying them is proposed. The rationale for
36 the design decisions is provided.
37
38
39 Motivation
40 ==========
41
42 The current Portage binary package format
43 -----------------------------------------
44
45 The historical ``.tbz2`` binary package format used by Portage is
46 a concatenation of two distinct formats: header-oriented compressed .tar
47 format (used to hold package files) and trailer-oriented custom XPAK
48 format (used to hold metadata) [#MAN-XPAK]_. The format has already
49 been extended incompatibly twice.
50
51 The first time, support for storing multiple successive builds of binary
52 package for a single ebuild version has been added. This feature relies
53 on appending additional hyphen, followed by an integer to the package
54 filename. It is disabled by default (preserving backwards
55 compatibility) and controlled by ``binpkg-multi-instance`` feature.
56
57 The second time, support for additional compression formats has been
58 added. When format other than bzip2 is used, the ``.tbz2`` suffix
59 is replaced by ``.xpak`` and Portage relies on magic bytes to detect
60 compression used. For backwards compatibility, Portage still defaults
61 to using bzip2; compression program can be switched using
62 ``BINPKG_COMPRESS`` configuration variable.
63
64 Additionally, there have been minor changes to the stored metadata
65 and file storage policies. In particular, behavior regarding
66 ``INSTALL_MASK``, controllable file compression and stripping has
67 changed over time.
68
69
70 The advantages of tbz2/XPAK format
71 ----------------------------------
72
73 The tbz2/XPAK format used by Portage has three interesting features:
74
75 1. **Each binary package is fully contained within a single file.**
76 While this might seem unnecessary, it makes it easier for the user
77 to transfer binary packages without having to be concerned about
78 finding all the necessary files to transfer.
79
80 2. **The binary packages are compatible with regular compressed
81 tarballs, most of the time.** With notable exceptions of historical
82 versions of pbzip2 and the recent zstd compressor, tbz2/XPAK packages
83 can be extracted using regular tar utility with a compressor
84 implementation that discards trailing garbage.
85
86 3. **The metadata is uncompressed, and can be efficiently accessed
87 without decompressing package contents.** This includes
88 the possibility of rewriting it (e.g. as a result of package moves)
89 without the necessity of repacking the files.
90
91
92 Transparency problem with the current binary package format
93 -----------------------------------------------------------
94
95 Notwithstanding its advantages, the tbz2/XPAK format has a significant
96 design fault that consists of two issues:
97
98 1. **The XPAK format is a custom binary format with explicit use
99 of binary-encoded file offsets and field lengths.** As such, it is
100 non-trivial to read or edit without specialized tools. Such tools
101 are currently implemented separately from the package manager,
102 as part of the portage-utils toolkit, written in C [#PORTAGE-UTILS]_.
103
104 2. **The tarball compatibility feature relies on obscure feature of
105 ignoring trailing garbage in compressed files**. While this is
106 implemented consistently in most of the compressors, this feature
107 is not really a part of specification but rather traditional
108 behavior. Given that the original reasons for this no longer apply,
109 new compressor implementations are likely to miss support for this.
110
111 Both of the issues make the format hard to use without dedicated tools,
112 or when the tools misbehave. This impacts the following scenarios:
113
114 A. **Using binary packages for system recovery.** In case of serious
115 breakage, it is really preferable that the format depends on as few
116 tools a possible, and especially not on Gentoo-specific tools.
117
118 B. **Inspecting binary packages in detail exceeding standard package
119 manager facilities.**
120
121 C. **Modifying binary packages in ways not predicted by the package
122 manager authors.** A real-life example of this is working around
123 broken ``pkg_*`` phases which prevent the package from being
124 installed.
125
126
127 OpenPGP extensibility problem
128 -----------------------------
129
130 There are at least three obvious ways in which the current format could
131 be extended to support OpenPGP signatures, and each of them has its own
132 distinct problem:
133
134 1. **Adding a detached signature.** This option is non-intrusive but
135 causes the format to no longer be contained in a single file.
136
137 2. **Wrapping the package in OpenPGP message format.** This would use
138 a standard format and make verification and unpacking relatively
139 easy. However, it would break backwards compatibility and add
140 explicit dependency on OpenPGP implementation in order to unpack
141 the package.
142
143 3. **Adding OpenPGP signature as extra XPAK member.** This is
144 the clever solution. It implies strengthening the dependency
145 on custom tooling, now additionally necessary to extract
146 the signature and reconstruct the original file to accommodate
147 verification.
148
149
150 Goals for a new container format
151 --------------------------------
152
153 All of the above considered, the new format should combine
154 the advantages of the existing format and at the same time address its
155 deficiencies whenever possible. Furthermore, since a format replacement
156 is taking place it is worthwhile to consider additional goals that could
157 be satisfied with little change.
158
159 The following obligatory goals have been set for a replacement format:
160
161 1. **The packages must remain contained in a single file.** As a matter
162 of user convenience, it should be possible to transfer binary
163 packages without having to use multiple files, and to install them
164 from any location.
165
166 2. **The file format must be entirely based on common file formats,
167 respecting best practices, with as little customization as necessary
168 to satisfy the requirements.** The format should be transparent
169 enough to let user inspect and manipulate it without special tooling
170 or detailed knowledge.
171
172 3. **The file format must provide support for OpenPGP signatures.**
173 Preferably, it should use standard OpenPGP message formats.
174
175 4. **The file format must allow for efficient metadata updates.**
176 In particular, it should be possible to update the metadata without
177 having to recompress package files.
178
179 Additionally, the following optional goals have been noted:
180
181 A. **The file format should account for easy recognition both through
182 filename and through contents.** Preferably, it should have distinct
183 features making it possible to detect it via file(1).
184
185 B. **The file format should provide for partial fetching of binary
186 packages.** It should be possible to easily fetch and read
187 the package metadata without having to download the whole package.
188
189 C. **The file format should allow for metadata compression.**
190
191 D. **The file format should make future extensions easily possible
192 without breaking backwards compatibility.**
193
194
195 Specification
196 =============
197
198 The container format
199 --------------------
200
201 The gpkg package container is an uncompressed .tar achive whose filename
202 should use ``.gpkg.tar`` suffix. This archive contains the following
203 members, all placed in a single directory whose name matches
204 the basename of the package file, in order:
205
206 1. The package identifier file ``gpkg-1.txt`` (required).
207
208 2. A signature for the metadata archive: ``metadata.tar${comp}.sig``
209 (optional).
210
211 3. The metadata archive ``metadata.tar${comp}``, optionally compressed
212 (required).
213
214 4. A signature for the filesystem image archive:
215 ``image.tar${comp}.sig`` (optional).
216
217 5. The filesystem image archive ``image.tar${comp}``, optionally
218 compressed (required).
219
220 It is recommended that relative order of the archive members is
221 preserved. However, implementations must support archives with members
222 out of order.
223
224 The container may be extended with additional members in the future.
225 The implementations should ignore unrecognized members and preserve
226 them across package updates.
227
228
229 Permitted .tar format features
230 ------------------------------
231
232 The tar archives should use either the POSIX ustar format or a subset
233 of the GNU format with the following (optional) extensions:
234
235 - long pathnames and long linknames,
236
237 - base-256 encoding of large file sizes.
238
239 Other extensions should be avoided whenever possible.
240
241
242 The package identifier file
243 ---------------------------
244
245 The package identifier file serves the purpose of identifying the binary
246 package format and its version.
247
248 The implementations must include a package identifier file named
249 ``gpkg-1.txt``. The filename includes package format version;
250 implementations should reject packages which do not contain this file
251 as unsupported format.
252
253 The file can have any contents. Normally, it should be empty.
254
255 Furthermore, this file should be included in the .tar archive
256 as the first member. This makes it possible to use it as an additional
257 magic at a fixed location that can be used by tools such as file(1)
258 to easily distinguish Gentoo binary packages from regular .tar archives.
259
260
261 The metadata archive
262 --------------------
263
264 The metadata archive stores the package metadata needed for the package
265 manager to process it. The archive should be included at the beginning
266 of the binary package in order to make it possible to read it out of
267 partially fetched binary package, and to avoid fetching the remaining
268 part of the package if not necessary.
269
270 The archive contains a single directory called ``metadata``. In this
271 directory, the individual metadata keys are stored as files. The exact
272 keys and metadata format is outside the scope of this specification.
273
274 The package manager may need to modify the package metadata. In this
275 case, it should replace the metadata archive without having to alter
276 other package members.
277
278 The metadata archive can optionally be compressed. It can also be
279 supplemented with a detached OpenPGP signature.
280
281
282 The image archive
283 -----------------
284
285 The image archive stores all the files to be installed by the binary
286 package. It should be included as the last of the files in the binary
287 package container.
288
289 The archive contains a single directory called ``image``. Inside this
290 directory, all package files are stored in filesystem layout, relative
291 to the root directory.
292
293 The image archive can optionally be compressed. It can also be
294 supplemented with a detached OpenPGP signature.
295
296
297 Archive member compression
298 --------------------------
299
300 The archive members outlined above support optional compression using
301 one of the compressed file formats supported by the package manager.
302 The exact list of compression types is outside the scope of this
303 specification.
304
305 The implementations must support archive members being uncompressed,
306 and must support using different compression types for different files.
307
308 When compressing an archive member, the member filename should be
309 suffixed using the standard suffix for the particular compressed file
310 type (e.g. ``.bz2`` for bzip2 format).
311
312
313 OpenPGP member signatures
314 -------------------------
315
316 The archive members support optional OpenPGP signatures.
317 The implementations must allow the user to specify whether OpenPGP
318 signatures are to be expected in remotely fetched packages.
319
320 If the signatures are expected and the archive member is unsigned, the
321 package manager must reject processing it. If the signature does not
322 verify, the package manager must reject processing the corresponding
323 archive member. In particular, it must not attempt decompressing
324 compressed members in those circumstances.
325
326 The signatures are created as binary detached OpenPGP signature files,
327 with filename corresponding to the member filename with ``.sig`` suffix
328 appended.
329
330 The exact details regarding creating and verifying signatures, as well
331 as maintaining and distributing keys are outside the scope of this
332 specification.
333
334
335 Rationale
336 =========
337
338 Package formats used by other distributions
339 -------------------------------------------
340
341 The research on the new package format included investigating
342 the possibility of reusing solutions from other operating system
343 distributions. While reusing a foreign package format would be
344 interesting, the differences in Gentoo metadata structure would prevent
345 any real compatibility. Some degree of compatibility might be achieved
346 through adapting the Gentoo metadata, however the costs of such
347 a solution would probably outweigh its usefulness.
348
349 Debian and its derivates are using the .deb package format. This is
350 a nested archive format, with the outer archive being of ar format,
351 and containing nested tarballs of control information (metadata)
352 and data [#DEB-FORMAT]_.
353
354 Red Hat, its derivates and some less related distributions are using
355 the RPM format. It is a custom binary format, storing metadata directly
356 and using a trailer cpio archive to store package files.
357
358 Arch Linux is using xz-compressed tarballs (suffixed ``.pkg.tar.xz``)
359 as its binary package format. The tarballs contain package files
360 on top-level, with specially named dotfiles used for package metadata.
361 OpenPGP signatures are stored as detached ``.sig`` files alongside
362 packages.
363
364 Exherbo is using the pbins format. In this format, the binary package
365 metadata is stored in repository alike ebuilds, and the binary package
366 files are stored separately and downloaded alike source tarballs.
367
368
369 Nested archive format
370 ---------------------
371
372 The basic problem in designing the new format was how to embed multiple
373 data streams (metadata, image) into a single file. Traditionally, this
374 has been done via using two non-conflicting file formats. However,
375 while such a solution is clever, it suffers in terms of transparency.
376
377 Therefore, it has been established that the new format should really
378 consist of a single archive format, with all necessary data
379 transparently accessible inside the file. Consequently, it has been
380 debated how different parts of binary package data should be stored
381 inside that archive.
382
383 The proposal to continue storing image data as top-level data
384 in the package format, and store metadata as special directory in that
385 structure has been discarded as a case of in-band signalling.
386
387 Finally, the proposal has been shaped to store different kinds of data
388 as nested archives in the outer binary package container. Besides
389 providing a clean way of accessing different kinds of information, it
390 makes it possible to add separate OpenPGP signatures to them.
391
392
393 Inner vs. outer compression
394 ---------------------------
395
396 One of the points in the new format debate was whether the binary
397 package as a whole should be compressed vs. compressing individual
398 members. The first option may seem as an obvious choice, especially
399 given that with a larger data set, the compression may proceed more
400 effectively. However, it has a single strong disadvantage: compression
401 prevents random access and manipulation of the binary package members.
402
403 While for the purpose of reading binary packages, the problem could be
404 circumvented through convenient member ordering and avoiding disjoint
405 reads of the binary package, metadata updates would either require
406 recompressing the whole package (which could be really time consuming
407 with large packages) or applying complex techniques such as splitting
408 the compressed archive into multiple compressed streams.
409
410 This considered, the simplest solution is to apply compression to
411 the individual package members, while leaving the container format
412 uncompressed. It provides fast random access to the individual members,
413 as well as capability of updating them without the necessity of
414 recompressing other files in the container.
415
416 This also makes it possible to easily protect compressed files using
417 standard OpenPGP detached signature format. All this combined,
418 the package manager may perform partial fetch of binary package, verify
419 the signature of its metadata member and process it without having to
420 fetch the potentially-large image part.
421
422
423 Container and archive formats
424 -----------------------------
425
426 During the debate, the actual archive formats to use were considered.
427 The .tar format seemed an obvious choice for the image archive since
428 it is the only widely deployed archive format that stores all kinds
429 of file metadata on POSIX systems. However, multiple options for
430 the outer format has been debated.
431
432 Firstly, the ZIP format has been proposed as the only commonly supported
433 format supporting adding files from stdin (i.e. making it possible to
434 pipe the inner archives straight into the container without using
435 temporary files). However, this format has been clearly rejected
436 as both not being present in the system set, and being trailer-based
437 and therefore unusable without having to fetch the whole file.
438
439 Secondly, the ar and cpio formats were considered. The former is used
440 by Debian and its derivative binary packages; the latter is used by Red
441 Hat derivatives. Both formats have the advantage of having less
442 historical baggage than .tar, and having less overhead. However, both
443 are also rather obscure (especially given that ar is actually provided
444 by GNU binutils rather than as a stand-alone archiver), considered
445 obsolete by POSIX and both have file size limitations smaller than .tar.
446
447 Thirdly, SquashFS was another interesting option. Its main advantage is
448 transparent compression support and ability to mount as a filesystem.
449 However, it has a significant implementation complexity, including mount
450 management and necessity of fallback to unsquashfs. Since the image
451 needs to be writable for the pre-installation manipulations, using it
452 via a mount would additionally require some kind of overlay filesystem.
453 Using it as top-level format has no real gain over a pipeline with tar,
454 and is certainly less portable. Therefore, there does not seem to be
455 a benefit in using SquashFS.
456
457 All that considered, it has been decided that there is no purpose
458 in using a second archive format in the specification unless it has
459 significant advantage to .tar. Therefore, .tar has also been used
460 as outer package format, even though it has larger overhead than other
461 formats (mostly due to padding).
462
463
464 .tar portability issues
465 -----------------------
466
467 The modern .tar dialects could be considered a dirty extensions
468 of the original .tar format. Three variants may be considered
469 of interest: POSIX ustar, pax (newer POSIX standard) and GNU tar.
470 All three formats are supported by GNU tar, whose presence on systems
471 used to create binary packages could be relied on. Therefore,
472 the portability concerns are related mostly to being able to read
473 and modify binary packages in scenarios of GNU tar being unavailable.
474
475 For the purpose of this specification, a detailed research
476 on portability of individual tar features has been conducted.
477 The research concluded to:
478
479 Judging by the test results, the most portability could be
480 achieved by:
481
482 - using strict POSIX ustar format whenever possible,
483
484 - using GNU format for long paths (that do not fix in ustar format),
485
486 - using base-256 (+ pax if already used) encoding for large files,
487
488 - using pax (+ octal or base-256) for high-range/precision
489 timestamps and user/group identifiers,
490
491 - using pax attributes for extended metadata and/or volume label.
492
493 It has been determined that for the purpose of binary package we really
494 only need to be concerned about long paths and huge files. Therefore,
495 the above was limited to the three first points and a guideline was
496 formed from them.
497
498 Debian has a similar guideline for the inner tar of their package
499 format has been created [#DEB-FORMAT]_.
500
501
502 Member ordering
503 ---------------
504
505 The member ordering is explicitly specified in order to provide for
506 trivially reading metadata from partially fetched archives.
507 By requiring the metadata archive to be stored before the image archive,
508 the package manager may stop fetching after reading it and save
509 bandwidth and/or space.
510
511
512 Detached OpenPGP signatures
513 ---------------------------
514
515 The use of detached OpenPGP signatures is to provide authenticity checks
516 for binary packages. Covering the complete members with signatures
517 provide for trivial verification of all metadata and image contents
518 respectively, without having to invent custom mechanisms for combining
519 them. Covering the compressed archives helps to prevent zipbomb
520 attacks. Covering the individual members rather than the whole package
521 provides for verification of partially fetched binary packages.
522
523
524 Format versioning
525 -----------------
526
527 The format is versioned through an explicit file, with the version
528 stored in the filename. If the format changes incompatible,
529 the filename changes and old implementations do not recognize it
530 as a valid package.
531
532 Previously, the format tried to avoid an explicit file for this purpose
533 and used volume label instead. However, the use of label has been
534 renounced due to unforeseen portability issues.
535
536
537 Backwards Compatibility
538 =======================
539
540 The format does not preserve backwards compatibility with the tbz2
541 packages. It has been established that preserving compatibility with
542 the old format was impossible without making the new format even worse
543 than the old one was.
544
545 For example, adding any visible members to the tarball would cause
546 them to be installed to the filesystem by old Portage versions. Working
547 around this would require some kind of awful hacks that would oppose
548 the goal of using simple and transparent package format.
549
550
551 Reference Implementation
552 ========================
553
554 The proof-of-concept implementation of binary package format converter
555 is available as xpak2gpkg [#XPAK2GPKG]_. It can be used to easily
556 create packages in the new format for early inspection.
557
558
559 References
560 ==========
561
562 .. [#MAN-XPAK] xpak - The XPAK Data Format used with Portage binary
563 packages
564 (https://dev.gentoo.org/~zmedico/portage/doc/man/xpak.5.html)
565
566 .. [#PORTAGE-UTILS] portage-utils: Small and fast Portage helper tools
567 written in C
568 (https://packages.gentoo.org/packages/app-portage/portage-utils)
569
570 .. [#DEB-FORMAT] deb(5) — Debian binary package format
571 (https://manpages.debian.org/unstable/dpkg-dev/deb.5.en.html)
572
573 .. [#TAR-PORTABILITY] Michał Górny, Portability of tar features
574 (https://dev.gentoo.org/~mgorny/articles/portability-of-tar-features.html)
575
576 .. [#XPAK2GPKG] xpak2gpkg: Proof-of-concept converter from tbz2/xpak
577 to gpkg binpkg format
578 (https://github.com/mgorny/xpak2gpkg)
579
580
581 Copyright
582 =========
583 This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
584 Unported License. To view a copy of this license, visit
585 http://creativecommons.org/licenses/by-sa/3.0/.
586
587
588 --
589 Best regards,
590 Michał Górny

Attachments

File name MIME type
signature.asc application/pgp-signature

Replies