1 |
Hi, everyone. |
2 |
|
3 |
The Gentoo's tbz2/xpak package format is quite old. We've made a few |
4 |
incompatible changes in the past (most notably, allowing non-bzip2 |
5 |
compression and multi-instance naming) but the core design stayed |
6 |
the same. I think we should consider changing it, for the reasons |
7 |
outlined below. |
8 |
|
9 |
The rough format description can be found in xpak(5). Basically, it's |
10 |
a regular compressed tarball with binary metadata blob appended |
11 |
to the end. As such, it looks like a regular compressed tarball |
12 |
to the compression tools (with some ignored junk at the end). |
13 |
The metadata is entirely custom format and needs dedicated tools |
14 |
to manipulate. |
15 |
|
16 |
|
17 |
The current format has a few advantages whose preserving would probably |
18 |
be worthwhile: |
19 |
|
20 |
+ The binary package is a single flat file. |
21 |
|
22 |
+ It is reasonably compatible with regular compressed tarball, |
23 |
so the users can unpack it using standard tools (except for metadata). |
24 |
|
25 |
+ The metadata is uncompressed and can be quickly found without touching |
26 |
the compressed data. |
27 |
|
28 |
+ The metadata can be updated (e.g. as result of pkgmove) without |
29 |
touching the compressed data. |
30 |
|
31 |
|
32 |
However, it has a few disadvantages as well: |
33 |
|
34 |
- The metadata is entirely custom binary format, requiring dedicated |
35 |
tools to read or edit. |
36 |
|
37 |
- The metadata format is relying on customary behavior of compression |
38 |
tools that ignore junk following the compressed data. |
39 |
|
40 |
- By placing the metadata at the end of file, we make it rather hard to |
41 |
read the metadata from remote location (via FTP, HTTP) without fetching |
42 |
the whole file. [NB: it's technically possible but probably not worth |
43 |
the effort] |
44 |
|
45 |
- By requiring the custom format to be at the end of file, we make it |
46 |
impossible to trivially cover it with a OpenPGP signature without |
47 |
introducing another custom format. |
48 |
|
49 |
- While the format might allow for some extensibility, it's rather |
50 |
evolutionary dead end. |
51 |
|
52 |
|
53 |
I think the key points of the new format should be: |
54 |
|
55 |
1. It should reuse common file formats as much as possible, with |
56 |
inventing as little custom code as possible. |
57 |
|
58 |
2. It should allow for easy introspection and editing by users without |
59 |
dedicated tools. |
60 |
|
61 |
3. The metadata should allow for lookup without fetching the whole |
62 |
binary package. |
63 |
|
64 |
4. The format should allow for some extensions without having to |
65 |
reinvent the wheel every time. |
66 |
|
67 |
5. It would be nice to preserve the existing advantages. |
68 |
|
69 |
|
70 |
My proposal |
71 |
=========== |
72 |
|
73 |
Basic format |
74 |
------------ |
75 |
The base of the format is a regular compressed tarball. There's no junk |
76 |
appended to it but the metadata is stored inside it as |
77 |
/var/db/pkg/${PF}. The contents are as compatible with the actual vdb |
78 |
format as possible. |
79 |
|
80 |
This has the following advantages: |
81 |
|
82 |
+ Binary package is still stored as a single file. |
83 |
|
84 |
+ It uses a standard compressed .tar format, with minimal customization. |
85 |
|
86 |
+ The user can easily inspect and modify the packages with standard |
87 |
tools (tar and the compressor). |
88 |
|
89 |
+ If we can maintain reasonable level of vdb compatibility, the user can |
90 |
even emergency-install a package without causing too much hassle (as it |
91 |
will be recorded in vdb); ideally Portage would detect this vdb entry |
92 |
and support fixing the install afterwards. |
93 |
|
94 |
|
95 |
Optimizing for easy recognition |
96 |
------------------------------- |
97 |
In order to make it possible for magic-based tools such as file(1) to |
98 |
easily distinguish Gentoo binary packages from regular tarballs, we |
99 |
could (ab)use the volume label field, e.g. use: |
100 |
|
101 |
$ tar -V 'gpkg: app-foo/bar-1' -c ... |
102 |
|
103 |
This will add a volume label as the first file entry inside the tarball, |
104 |
which does not affect extracting but can be trivially matched via magic |
105 |
rules. |
106 |
|
107 |
Note: this is meant to be used as a method for fast binary package |
108 |
recognition; I don't think we should reject (hand-modified) binary |
109 |
packages that lack this label. |
110 |
|
111 |
|
112 |
Optimizing for metadata reading/manipulation performance |
113 |
-------------------------------------------------------- |
114 |
The main problem with using a single tarball for both metadata and data |
115 |
is that normally you'd have to decompress everything to reliably unpack |
116 |
metadata, and recompress everything to update it. This problem can be |
117 |
addressed by a few optimization tricks. |
118 |
|
119 |
Firstly, all metadata files are packed to the archive before data files. |
120 |
With a slightly customized unpacker, we can stop decompressing as soon |
121 |
as we're past metadata and avoid decompressing the whole archive. This |
122 |
will also make it possible to read metadata from remote files without |
123 |
fetching far past the compressed metadata block. |
124 |
|
125 |
Secondly, if we're up for some more tricks, we could technically split |
126 |
the tarball into metadata and data blocks compressed separately. This |
127 |
will need a bit of archiver customization but it will make it possible |
128 |
to decompress the metadata part without even touching compressed data, |
129 |
and to replace it without recompressing data. |
130 |
|
131 |
What's important is that both tricks proposed maintain backwards |
132 |
compatibility with regular compressed tarballs. That is, the user will |
133 |
still be able to extract it with regular archiving tools. |
134 |
|
135 |
|
136 |
Adding OpenPGP signatures |
137 |
------------------------- |
138 |
This is the main XXX here. |
139 |
|
140 |
Technically, the most obvious solution is to cover the entire tarball |
141 |
with OpenPGP signature. However, this has the disadvantage that |
142 |
the verification requires fetching the whole file. |
143 |
|
144 |
I will look into possibility of having partial signatures. |
145 |
|
146 |
|
147 |
-- |
148 |
Best regards, |
149 |
Michał Górny |