Gentoo Archives: gentoo-dev

From: "Michał Górny" <mgorny@g.o>
To: gentoo-dev <gentoo-dev@l.g.o>
Subject: [gentoo-dev] [pre-GLEP] Gentoo binary package container format
Date: Sat, 17 Nov 2018 11:21:52
Message-Id: 1542453700.31427.2.camel@gentoo.org
1 Hi,
2
3 Here's a pre-GLEP draft based on the earlier discussion on gentoo-
4 portage-dev mailing list. The specification uses GLEP form as it
5 provides for cleanly specifying the motivation and rationale.
6
7 (Note: the number assignment is not official, just took the next number
8 to satisfy the glep converter script)
9
10 Also available via HTTPS:
11
12 rst: https://dev.gentoo.org/~mgorny/tmp/glep-0078.rst
13 html: https://dev.gentoo.org/~mgorny/tmp/glep-0078.html
14
15 ---
16 GLEP: 78
17 Title: Gentoo binary package container format
18 Author: Michał Górny <mgorny@g.o>
19 Type: Standards Track
20 Status: Draft
21 Version: 1
22 Created: 2018-11-15
23 Last-Modified: 2018-11-16
24 Post-History: 2018-11-17
25 Content-Type: text/x-rst
26 ---
27
28 Abstract
29 ========
30
31 This GLEP proposes a new binary package container format for Gentoo.
32 The current tbz2/XPAK format is shortly described, and its deficiences
33 are listed. Accordingly, the requirements for a new format are set
34 and a gpkg format satisfying them is proposed. The rationale for
35 various design decisions is provided.
36
37
38 Motivation
39 ==========
40
41 The current Portage binary package format
42 -----------------------------------------
43
44 The historical ``.tbz2`` binary package format used by Portage is
45 a concatenation of two distinct formats: header-oriented compressed .tar
46 format (used to hold package files) and trailer-oriented custom XPAK
47 format (used to hold metadata) [#MAN-XPAK]_. The format has already
48 been extended incompatibly twice.
49
50 The first time, support for storing multiple successive builds of binary
51 package for a single ebuild version has been added. This feature relies
52 on appending additional hyphen, followed by an integer to the package
53 filename. It is disabled by default (preserving backwards
54 compatibility) and controlled by ``binpkg-multi-instance`` feature.
55
56 The second time, support for additional compression formats has been
57 added. When format other than bzip2 is used, the ``.tbz2`` suffix
58 is replaced by ``.xpak`` and Portage relies on magic bytes to detect
59 compression used. For backwards compatibility, Portage still defaults
60 to using bzip2; compression program can be switched using
61 ``BINPKG_COMPRESS`` configuration variable.
62
63 Additionally, there have been minor changes to the stored metadata
64 and file storage policies. In particular, behavior regarding
65 ``INSTALL_MASK``, controllable file compression and stripping has
66 changed over time.
67
68
69 Problems with the current binary package format
70 -----------------------------------------------
71
72 The following problems were identified with the package format currently
73 in use:
74
75 1. **The packages rely on custom binary archive format to store
76 metadata.** It is entirely Gentoo invented, and requires dedicated
77 tooling to work with it. In fact, the reference implementation
78 in Portage does not even include a CLI tool to work with tbz2
79 packages; an unofficial implementation is provided as part
80 of portage-utils toolkit [#PORTAGE-UTILS]_.
81
82 2. **The format relies on obscure compressor feature of ignoring
83 trailing garbage**. While this behavior is traditionally implemented
84 by many compressors, the original reasons for it have become long
85 irrelevant and it is not surprising that new compressors do not
86 support it. In particular, Portage already hit this problem twice:
87 once when users replaced bzip2 with parallel-capable pbzip2
88 implementation [#PBZIP2]_, and the second time when support for zstd
89 compressor was added [#ZSTD]_.
90
91 3. **Placing metadata at the end of file makes partial fetches
92 complex.** While it is technically possible to obtain package
93 metadata remotely without fetching the whole package, it usually
94 requires e.g. 2-3 HTTP requests with rather complex driver. For
95 comparison, if metadata was placed at the beginning of the file,
96 early-terminated pipeline with a single fetch request would suffice.
97
98 4. **Extending the format with OpenPGP signatures is non-trivial.**
99 Depending on the implementation details, it either requires fetching
100 additional detached signature, breaking backwards compatibility or
101 introducing more custom logic to reassemble OpenPGP packets.
102
103 5. **Metadata is not compressed.** This is not a significant problem,
104 it is just listed for completeness.
105
106
107 Goals for a new container format
108 --------------------------------
109
110 The following goals have been set for a replacement format:
111
112 1. **The packages must remain contained in a single file.** As a matter
113 of user convenience, it should be possible to transfer binary
114 packages without having to use multiple files, and to install them
115 from any location.
116
117 2. **The file format must be entirely based on common file formats,
118 respecting best practices, with as little customization as necessary
119 to satisfy the requirements.** In particular, it is unacceptable
120 to create new binary formats.
121
122 3. **The file format should provide for partial fetching of binary
123 packages.** It should be possible to easily fetch and read
124 the package metadata without having to download the whole package.
125
126 4. **The file format must provide support for OpenPGP signatures.**
127 Preferably, it should use standard OpenPGP message formats.
128
129 5. **The file format must allow for efficient metadata updates.**
130 In particular, it should be possible to update the metadata without
131 having to recompress package files.
132
133 6. **The file format should account for easy recognition both through
134 filename and through contents.** Preferably, it should have distinct
135 features making it possible to detect it via file(1).
136
137 7. **The file format should allow for metadata compression.**
138
139 8. **The file format should make future extensions easily possible
140 without breaking backwards compatibility.**
141
142
143 Specification
144 =============
145
146 The container format
147 --------------------
148
149 The gpkg package container is an uncompressed .tar achive whose filename
150 uses ``.gpkg.tar`` suffix. This archive contains the following members,
151 in order:
152
153 1. A volume label: ``gpkg: ${full_package_identifier}`` (optional).
154
155 2. A signature for the metadata archive: ``metadata.tar${comp}.sig``
156 (optional).
157
158 3. The metadata archive ``metadata.tar${comp}``, optionally compressed
159 (required).
160
161 4. A signature for the filesystem image archive:
162 ``image.tar${comp}.sig`` (optional).
163
164 5. The filesystem image archive ``image.tar${comp}``, optionally
165 compressed (required).
166
167 It is recommended that relative order of the archive members is
168 preserved. However, implementations must support archives with members
169 out of order.
170
171 The container may be extended with additional members in the future.
172 The implementations should ignore unrecognized members and preserve
173 them across package updates.
174
175
176 The volume label
177 ----------------
178
179 The volume label provides an easy way for users to identify the binary
180 package without dedicated tooling or specific format knowledge.
181
182 The implementations should include a volume label consisting of fixed
183 string ``gpkg:``, followed by a single space, followed by full package
184 identifier. However, the implementations must not rely on the volume
185 label being present or attempt to parse its value when it is.
186
187 Furthermore, since the volume label is included in the .tar archive
188 as the first member, it provides a magic string at a fixed location
189 that can be used by tools such as file(1) to easily distinguish Gentoo
190 binary packages from regular .tar archives.
191
192
193 The metadata archive
194 --------------------
195
196 The metadata archive stores the package metadata needed for the package
197 manager to process it. The archive should be included at the beginning
198 of the binary package in order to make it possible to read it out of
199 partially fetched binary package, and to avoid fetching the remaining
200 part of the package if not necessary.
201
202 The archive contains a single directory called ``metadata``. In this
203 directory, the individual metadata keys are stored as files. The exact
204 keys and metadata format is outside the scope of this specification.
205
206 The package manager may need to modify the package metadata. In this
207 case, it should replace the metadata archive without having to alter
208 other package members.
209
210 The metadata archive can optionally be compressed. It can also be
211 supplemented with a detached OpenPGP signature.
212
213
214 The image archive
215 -----------------
216
217 The image archive stores all the files to be installed by the binary
218 package. It should be included as the last of the files in the binary
219 package container.
220
221 The archive contains a single directory called ``image``. Inside this
222 directory, all package files are stored in filesystem layout, relative
223 to the root directory.
224
225 The image archive can optionally be compressed. It can also be
226 supplemented with a detached OpenPGP signature.
227
228
229 Archive member compression
230 --------------------------
231
232 The archive members outlined above support optional compression using
233 one of the compressed file formats supported by the package manager.
234 The exact list of compression types is outside the scope of this
235 specification.
236
237 The implementations must support archive members being uncompressed,
238 and must support using different compression types for different files.
239
240 When compressing an archive member, the member filename should be
241 suffixed using the standard suffix for the particular compressed file
242 type (e.g. ``.bz2`` for bzip2 format).
243
244
245 OpenPGP member signatures
246 -------------------------
247
248 The archive members support optional OpenPGP signatures.
249 The implementations must allow the user to specify whether OpenPGP
250 signatures are to be expected in remotely fetched packages.
251
252 If the signatures are expected and the archive member is unsigned, the
253 package manager must reject processing it. If the signature does not
254 verify, the package manager must reject processing the corresponding
255 archive member. In particular, it must not attempt decompressing
256 compressed members in those circumstances.
257
258 If the implementation needs to manipulate archive members, it must
259 either create a new signature or discard the existing signature.
260
261 The signatures are created as binary detached OpenPGP signature files,
262 with filename corresponding to the member filename with ``.sig`` suffix
263 appended.
264
265
266 Rationale
267 =========
268
269 Nested archive format
270 ---------------------
271
272 The basic problem in designing the new format was how to embed multiple
273 data streams (metadata, image) into a single file. Traditionally, this
274 has been done via using two non-conflicting file formats. However,
275 while such a solution is clever, it suffers in terms of transparency.
276
277 Therefore, it has been established that the new format should really
278 consist of a single archive format, with all necessary data
279 transparently accessible inside the file. Consequently, it has been
280 debated how different parts of binary package data should be stored
281 inside that archive.
282
283 The proposal to continue storing image data as top-level data
284 in the package format, and store metadata as special directory in that
285 structure has been discarded as a case of in-band signalling.
286
287 Finally, the proposal has been shaped to store different kinds of data
288 as nested archives in the outer binary package container. Besides
289 providing a clean way of accessing different kinds of information, it
290 makes it possible to add separate OpenPGP signatures to them.
291
292
293 Inner vs. outer compression
294 ---------------------------
295
296 One of the points in the new format debate was whether the binary
297 package as a whole should be compressed vs. compressing individual
298 members. The first option may seem as an obvious choice, especially
299 given that with a larger data set, the compression may proceed more
300 effectively. However, it has a single strong disadvantage: compression
301 prevents random access and manipulation of the binary package members.
302
303 While for the purpose of reading binary packages, the problem could be
304 circumvented through convenient member ordering and avoiding disjoint
305 reads of the binary package, metadata updates would either require
306 recompressing the whole package (which could be really time consuming
307 with large packages) or applying complex techniques such as splitting
308 the compressed archive into multiple compressed streams.
309
310 This considered, the simplest solution is to apply compression to
311 the individual package members, while leaving the container format
312 uncompressed. It provides fast random access to the individual members,
313 as well as capability of updating them without the necessity of
314 recompressing other files in the container.
315
316 This also makes it possible to easily protect compressed files using
317 standard OpenPGP detached signature format. All this combined,
318 the package manager may perform partial fetch of binary package, verify
319 the signature of its metadata member and process it without having to
320 fetch the potentially-large image part.
321
322
323 Container and archive formats
324 -----------------------------
325
326 During the debate, the actual archive formats to use were considered.
327 The .tar format seemed an obvious choice for the image archive since
328 it is the only widely deployed archive format that stores all kinds
329 of file metadata on POSIX systems. However, multiple options for
330 the outer format has been debated.
331
332 Firstly, the ZIP format has been proposed as the only commonly supported
333 format supporting adding files from stdin (i.e. making it possible to
334 pipe the inner archives straight into the container without using
335 temporary files). However, this format has been clearly rejected
336 as both not being present in the system set, and being trailer-based
337 and therefore unusable without having to fetch the whole file.
338
339 Secondly, the ar and cpio formats were considered. The former is used
340 by Debian and its derivative binary packages; the latter is used by Red
341 Hat derivatives. Both formats have the advantage of having less
342 historical baggage than .tar, and having less overhead. However, both
343 are also rather obscure (especially given that ar is actually provided
344 by GNU binutils rather than as a stand-alone archiver), considered
345 obsolete by POSIX and both have file size limitations smaller than .tar.
346
347 All that considered, it has been decided that there is no purpose
348 in using a second archive format in the specification unless it has
349 significant advantage to .tar. Therefore, .tar has also been used
350 as outer package format, even though it has larger overhead than other
351 formats (mostly due to padding).
352
353
354 Member ordering
355 ---------------
356
357 The member ordering is explicitly specified in order to provide for
358 trivially reading metadata from partially fetched archives.
359 By requiring the metadata archive to be stored before the image archive,
360 the package manager may stop fetching after reading it and save
361 bandwidth and/or space.
362
363
364 Detached OpenPGP signatures
365 ---------------------------
366
367 The use of detached OpenPGP signatures is to provide authenticity checks
368 for binary packages. Covering the complete members with signatures
369 provide for trivial verification of all metadata and image contents
370 respectively, without having to invent custom mechanisms for combining
371 them. Covering the compressed archives helps to prevent zipbomb
372 attacks. Covering the individual members rather than the whole package
373 provides for verification of partially fetched binary packages.
374
375
376 Backwards Compatibility
377 =======================
378
379 The format does not preserve backwards compatibility with the tbz2
380 packages. It has been established that preserving compatibility with
381 the old format was impossible without making the new format even worse
382 than the old one was.
383
384 For example, adding any visible members to the tarball would cause
385 them to be installed to the filesystem by old Portage versions. Working
386 around this would require some kind of awful hacks that would oppose
387 the goal of using simple and transparent package format.
388
389
390 Reference Implementation
391 ========================
392
393 The proof-of-concept implementation of binary package format converter
394 is available as xpak2gpkg [#XPAK2GPKG]_. It can be used to easily
395 create packages in the new format for early inspection.
396
397
398 References
399 ==========
400
401 .. [#MAN-XPAK] xpak - The XPAK Data Format used with Portage binary
402 packages
403 (https://dev.gentoo.org/~zmedico/portage/doc/man/xpak.5.html)
404
405 .. [#PORTAGE-UTILS] portage-utils: Small and fast Portage helper tools
406 written in C
407 (https://packages.gentoo.org/packages/app-portage/portage-utils)
408
409 .. [#PBZIP2] PBZIP2 - a parallel implementation of the bzip2
410 block-sorting file compressor
411 (https://launchpad.net/pbzip2)
412
413 .. [#ZSTD] Zstandard - Real-time data compression algorithm
414 (https://facebook.github.io/zstd/)
415
416 .. [#XPAK2GPKG] xpak2gpkg: Proof-of-concept converter from tbz2/xpak
417 to gpkg binpkg format
418 (https://github.com/mgorny/xpak2gpkg)
419
420
421 Copyright
422 =========
423 This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
424 Unported License. To view a copy of this license, visit
425 http://creativecommons.org/licenses/by-sa/3.0/.
426
427 --
428 Best regards,
429 Michał Górny

Attachments

File name MIME type
signature.asc application/pgp-signature

Replies