1 |
On Sat, Nov 10, 2018 at 8:09 AM Michał Górny <mgorny@g.o> wrote: |
2 |
|
3 |
> Hi, everyone. |
4 |
> |
5 |
> The Gentoo's tbz2/xpak package format is quite old. We've made a few |
6 |
> incompatible changes in the past (most notably, allowing non-bzip2 |
7 |
> compression and multi-instance naming) but the core design stayed |
8 |
> the same. I think we should consider changing it, for the reasons |
9 |
> outlined below. |
10 |
> |
11 |
> The rough format description can be found in xpak(5). Basically, it's |
12 |
> a regular compressed tarball with binary metadata blob appended |
13 |
> to the end. As such, it looks like a regular compressed tarball |
14 |
> to the compression tools (with some ignored junk at the end). |
15 |
> The metadata is entirely custom format and needs dedicated tools |
16 |
> to manipulate. |
17 |
> |
18 |
> |
19 |
> The current format has a few advantages whose preserving would probably |
20 |
> be worthwhile: |
21 |
> |
22 |
> + The binary package is a single flat file. |
23 |
> |
24 |
> + It is reasonably compatible with regular compressed tarball, |
25 |
> so the users can unpack it using standard tools (except for metadata). |
26 |
> |
27 |
> + The metadata is uncompressed and can be quickly found without touching |
28 |
> the compressed data. |
29 |
> |
30 |
> + The metadata can be updated (e.g. as result of pkgmove) without |
31 |
> touching the compressed data. |
32 |
> |
33 |
> |
34 |
> However, it has a few disadvantages as well: |
35 |
> |
36 |
> - The metadata is entirely custom binary format, requiring dedicated |
37 |
> tools to read or edit. |
38 |
> |
39 |
> - The metadata format is relying on customary behavior of compression |
40 |
> tools that ignore junk following the compressed data. |
41 |
> |
42 |
|
43 |
I agree this is a problem in theory, but I haven't seen it as a problem in |
44 |
practice. Have you observed any problems around this setup? |
45 |
|
46 |
|
47 |
> |
48 |
> - By placing the metadata at the end of file, we make it rather hard to |
49 |
> read the metadata from remote location (via FTP, HTTP) without fetching |
50 |
> the whole file. [NB: it's technically possible but probably not worth |
51 |
> the effort] |
52 |
|
53 |
|
54 |
> - By requiring the custom format to be at the end of file, we make it |
55 |
> impossible to trivially cover it with a OpenPGP signature without |
56 |
> introducing another custom format. |
57 |
> |
58 |
|
59 |
Its trivial to cover with a detached sig, no? |
60 |
|
61 |
|
62 |
> |
63 |
> - While the format might allow for some extensibility, it's rather |
64 |
> evolutionary dead end. |
65 |
> |
66 |
|
67 |
I'm not even sure how to quantify this, it just sounds like your subjective |
68 |
opinion (which is fine, but its not factual.) |
69 |
|
70 |
|
71 |
> |
72 |
> |
73 |
> I think the key points of the new format should be: |
74 |
> |
75 |
> 1. It should reuse common file formats as much as possible, with |
76 |
> inventing as little custom code as possible. |
77 |
> |
78 |
> 2. It should allow for easy introspection and editing by users without |
79 |
> dedicated tools. |
80 |
> |
81 |
|
82 |
So I'm less confident in the editing use cases; do users edit their binpkgs |
83 |
on a regular basis? |
84 |
|
85 |
|
86 |
> |
87 |
> 3. The metadata should allow for lookup without fetching the whole |
88 |
> binary package. |
89 |
> |
90 |
> 4. The format should allow for some extensions without having to |
91 |
> reinvent the wheel every time. |
92 |
> |
93 |
> 5. It would be nice to preserve the existing advantages. |
94 |
> |
95 |
> |
96 |
> My proposal |
97 |
> =========== |
98 |
> |
99 |
> Basic format |
100 |
> ------------ |
101 |
> The base of the format is a regular compressed tarball. There's no junk |
102 |
> appended to it but the metadata is stored inside it as |
103 |
> /var/db/pkg/${PF}. The contents are as compatible with the actual vdb |
104 |
> format as possible. |
105 |
> |
106 |
|
107 |
Just to clarify, you are suggesting we store the metadata inside the |
108 |
contents of the binary package itself (e.g. where the other files that get |
109 |
merged to the liveFS are?) What about collisions? |
110 |
|
111 |
E.g. I install 'machine-images/gentoo-disk-image-1.2.3' on a machine that |
112 |
already has 'machine-images/gentoo-disk-image-1.2.3' installed, won't it |
113 |
overwrite files in the VDB at qmerge time? |
114 |
|
115 |
|
116 |
> |
117 |
> This has the following advantages: |
118 |
> |
119 |
> + Binary package is still stored as a single file. |
120 |
> |
121 |
> + It uses a standard compressed .tar format, with minimal customization. |
122 |
> |
123 |
> + The user can easily inspect and modify the packages with standard |
124 |
> tools (tar and the compressor). |
125 |
> |
126 |
> + If we can maintain reasonable level of vdb compatibility, the user can |
127 |
> even emergency-install a package without causing too much hassle (as it |
128 |
> will be recorded in vdb); ideally Portage would detect this vdb entry |
129 |
> and support fixing the install afterwards. |
130 |
> |
131 |
|
132 |
I'm not certain this is really desired. |
133 |
|
134 |
|
135 |
> |
136 |
> |
137 |
> Optimizing for easy recognition |
138 |
> ------------------------------- |
139 |
> In order to make it possible for magic-based tools such as file(1) to |
140 |
> easily distinguish Gentoo binary packages from regular tarballs, we |
141 |
> could (ab)use the volume label field, e.g. use: |
142 |
> |
143 |
> $ tar -V 'gpkg: app-foo/bar-1' -c ... |
144 |
> |
145 |
> This will add a volume label as the first file entry inside the tarball, |
146 |
> which does not affect extracting but can be trivially matched via magic |
147 |
> rules. |
148 |
> |
149 |
> Note: this is meant to be used as a method for fast binary package |
150 |
> recognition; I don't think we should reject (hand-modified) binary |
151 |
> packages that lack this label. |
152 |
> |
153 |
> |
154 |
> Optimizing for metadata reading/manipulation performance |
155 |
> -------------------------------------------------------- |
156 |
> The main problem with using a single tarball for both metadata and data |
157 |
> is that normally you'd have to decompress everything to reliably unpack |
158 |
> metadata, and recompress everything to update it. This problem can be |
159 |
> addressed by a few optimization tricks. |
160 |
> |
161 |
|
162 |
These performance goals seem a little bit ill defined. |
163 |
|
164 |
1) Where are users reporting slowness in binpkg operations? |
165 |
2) What is the cause of the slowness? |
166 |
|
167 |
Like I could easily see a potential user with many large binpkgs, and the |
168 |
current implementation causing them issues because |
169 |
they have to decompress and seek a bunch to read the metadata out of their |
170 |
1.2GB binpkg. But i'm pretty sure this isn't most users. |
171 |
|
172 |
|
173 |
> |
174 |
> Firstly, all metadata files are packed to the archive before data files. |
175 |
> With a slightly customized unpacker, we can stop decompressing as soon |
176 |
> as we're past metadata and avoid decompressing the whole archive. This |
177 |
> will also make it possible to read metadata from remote files without |
178 |
> fetching far past the compressed metadata block. |
179 |
> |
180 |
|
181 |
So this seems to basically go against your goals of simple common tooling? |
182 |
|
183 |
|
184 |
> |
185 |
> Secondly, if we're up for some more tricks, we could technically split |
186 |
> the tarball into metadata and data blocks compressed separately. This |
187 |
> will need a bit of archiver customization but it will make it possible |
188 |
> to decompress the metadata part without even touching compressed data, |
189 |
> and to replace it without recompressing data. |
190 |
> |
191 |
> What's important is that both tricks proposed maintain backwards |
192 |
> compatibility with regular compressed tarballs. That is, the user will |
193 |
> still be able to extract it with regular archiving tools. |
194 |
|
195 |
|
196 |
So my recollection is that debian uses common format AR files for the main |
197 |
deb. |
198 |
Then they have 2 compressed tarballs, one for metadata, and one for data. |
199 |
|
200 |
This format seems to jive with many of your requirements: |
201 |
|
202 |
- 'ar' can retrieve individual files from the archive. |
203 |
- The deb file itself is not compressed, but the tarballs inside *are* |
204 |
compressed. |
205 |
- The metadata and data are compressed separately. |
206 |
- Anyone can edit this with normal tooling (ar, tar) |
207 |
|
208 |
In short; why should we event a new format? |
209 |
|
210 |
|
211 |
> |
212 |
> |
213 |
> Adding OpenPGP signatures |
214 |
> ------------------------- |
215 |
> This is the main XXX here. |
216 |
> |
217 |
> Technically, the most obvious solution is to cover the entire tarball |
218 |
> with OpenPGP signature. However, this has the disadvantage that |
219 |
> the verification requires fetching the whole file. |
220 |
> |
221 |
> I will look into possibility of having partial signatures. |
222 |
> |
223 |
> |
224 |
> -- |
225 |
> Best regards, |
226 |
> Michał Górny |
227 |
> |