1 |
On Sat, 2018-11-10 at 09:37 -0500, Alec Warner wrote: |
2 |
> On Sat, Nov 10, 2018 at 8:09 AM Michał Górny <mgorny@g.o> wrote: |
3 |
> |
4 |
> > Hi, everyone. |
5 |
> > |
6 |
> > The Gentoo's tbz2/xpak package format is quite old. We've made a few |
7 |
> > incompatible changes in the past (most notably, allowing non-bzip2 |
8 |
> > compression and multi-instance naming) but the core design stayed |
9 |
> > the same. I think we should consider changing it, for the reasons |
10 |
> > outlined below. |
11 |
> > |
12 |
> > The rough format description can be found in xpak(5). Basically, it's |
13 |
> > a regular compressed tarball with binary metadata blob appended |
14 |
> > to the end. As such, it looks like a regular compressed tarball |
15 |
> > to the compression tools (with some ignored junk at the end). |
16 |
> > The metadata is entirely custom format and needs dedicated tools |
17 |
> > to manipulate. |
18 |
> > |
19 |
> > |
20 |
> > The current format has a few advantages whose preserving would probably |
21 |
> > be worthwhile: |
22 |
> > |
23 |
> > + The binary package is a single flat file. |
24 |
> > |
25 |
> > + It is reasonably compatible with regular compressed tarball, |
26 |
> > so the users can unpack it using standard tools (except for metadata). |
27 |
> > |
28 |
> > + The metadata is uncompressed and can be quickly found without touching |
29 |
> > the compressed data. |
30 |
> > |
31 |
> > + The metadata can be updated (e.g. as result of pkgmove) without |
32 |
> > touching the compressed data. |
33 |
> > |
34 |
> > |
35 |
> > However, it has a few disadvantages as well: |
36 |
> > |
37 |
> > - The metadata is entirely custom binary format, requiring dedicated |
38 |
> > tools to read or edit. |
39 |
> > |
40 |
> > - The metadata format is relying on customary behavior of compression |
41 |
> > tools that ignore junk following the compressed data. |
42 |
> > |
43 |
> |
44 |
> I agree this is a problem in theory, but I haven't seen it as a problem in |
45 |
> practice. Have you observed any problems around this setup? |
46 |
|
47 |
Historically one of the parallel compressor variants did not support |
48 |
this. |
49 |
|
50 |
> > |
51 |
> > - By placing the metadata at the end of file, we make it rather hard to |
52 |
> > read the metadata from remote location (via FTP, HTTP) without fetching |
53 |
> > the whole file. [NB: it's technically possible but probably not worth |
54 |
> > the effort] |
55 |
> |
56 |
> |
57 |
> > - By requiring the custom format to be at the end of file, we make it |
58 |
> > impossible to trivially cover it with a OpenPGP signature without |
59 |
> > introducing another custom format. |
60 |
> > |
61 |
> |
62 |
> Its trivial to cover with a detached sig, no? |
63 |
> |
64 |
> |
65 |
> > |
66 |
> > - While the format might allow for some extensibility, it's rather |
67 |
> > evolutionary dead end. |
68 |
> > |
69 |
> |
70 |
> I'm not even sure how to quantify this, it just sounds like your subjective |
71 |
> opinion (which is fine, but its not factual.) |
72 |
> |
73 |
> |
74 |
> > |
75 |
> > |
76 |
> > I think the key points of the new format should be: |
77 |
> > |
78 |
> > 1. It should reuse common file formats as much as possible, with |
79 |
> > inventing as little custom code as possible. |
80 |
> > |
81 |
> > 2. It should allow for easy introspection and editing by users without |
82 |
> > dedicated tools. |
83 |
> > |
84 |
> |
85 |
> So I'm less confident in the editing use cases; do users edit their binpkgs |
86 |
> on a regular basis? |
87 |
|
88 |
It's useful for debugging stuff. I had to use hexedit on xpak |
89 |
in the past. Believe me, it's nowhere close to pleasant. |
90 |
|
91 |
> > 3. The metadata should allow for lookup without fetching the whole |
92 |
> > binary package. |
93 |
> > |
94 |
> > 4. The format should allow for some extensions without having to |
95 |
> > reinvent the wheel every time. |
96 |
> > |
97 |
> > 5. It would be nice to preserve the existing advantages. |
98 |
> > |
99 |
> > |
100 |
> > My proposal |
101 |
> > =========== |
102 |
> > |
103 |
> > Basic format |
104 |
> > ------------ |
105 |
> > The base of the format is a regular compressed tarball. There's no junk |
106 |
> > appended to it but the metadata is stored inside it as |
107 |
> > /var/db/pkg/${PF}. The contents are as compatible with the actual vdb |
108 |
> > format as possible. |
109 |
> > |
110 |
> |
111 |
> Just to clarify, you are suggesting we store the metadata inside the |
112 |
> contents of the binary package itself (e.g. where the other files that get |
113 |
> merged to the liveFS are?) What about collisions? |
114 |
> |
115 |
> E.g. I install 'machine-images/gentoo-disk-image-1.2.3' on a machine that |
116 |
> already has 'machine-images/gentoo-disk-image-1.2.3' installed, won't it |
117 |
> overwrite files in the VDB at qmerge time? |
118 |
|
119 |
Portage will obviously move the files out, and process them as metadata. |
120 |
The idea is to precisely use a directory that can't be normally part |
121 |
of binary packages, so can't cause collisions with real files (even if |
122 |
they're very unlikely to ever happen). |
123 |
|
124 |
> > This has the following advantages: |
125 |
> > |
126 |
> > + Binary package is still stored as a single file. |
127 |
> > |
128 |
> > + It uses a standard compressed .tar format, with minimal customization. |
129 |
> > |
130 |
> > + The user can easily inspect and modify the packages with standard |
131 |
> > tools (tar and the compressor). |
132 |
> > |
133 |
> > + If we can maintain reasonable level of vdb compatibility, the user can |
134 |
> > even emergency-install a package without causing too much hassle (as it |
135 |
> > will be recorded in vdb); ideally Portage would detect this vdb entry |
136 |
> > and support fixing the install afterwards. |
137 |
> > |
138 |
> |
139 |
> I'm not certain this is really desired. |
140 |
|
141 |
Are you saying it's better that user emergency-installs a package |
142 |
without recording it in vdb, and ends up with mess of collisions |
143 |
and untracked files? |
144 |
|
145 |
Just because you don't like some use case doesn't mean it's not gonna |
146 |
happen. Either you prepare for it and make the best of it, or you |
147 |
pretend it's not gonna happen and cause extra pain to users. |
148 |
|
149 |
> > Optimizing for easy recognition |
150 |
> > ------------------------------- |
151 |
> > In order to make it possible for magic-based tools such as file(1) to |
152 |
> > easily distinguish Gentoo binary packages from regular tarballs, we |
153 |
> > could (ab)use the volume label field, e.g. use: |
154 |
> > |
155 |
> > $ tar -V 'gpkg: app-foo/bar-1' -c ... |
156 |
> > |
157 |
> > This will add a volume label as the first file entry inside the tarball, |
158 |
> > which does not affect extracting but can be trivially matched via magic |
159 |
> > rules. |
160 |
> > |
161 |
> > Note: this is meant to be used as a method for fast binary package |
162 |
> > recognition; I don't think we should reject (hand-modified) binary |
163 |
> > packages that lack this label. |
164 |
> > |
165 |
> > |
166 |
> > Optimizing for metadata reading/manipulation performance |
167 |
> > -------------------------------------------------------- |
168 |
> > The main problem with using a single tarball for both metadata and data |
169 |
> > is that normally you'd have to decompress everything to reliably unpack |
170 |
> > metadata, and recompress everything to update it. This problem can be |
171 |
> > addressed by a few optimization tricks. |
172 |
> > |
173 |
> |
174 |
> These performance goals seem a little bit ill defined. |
175 |
> |
176 |
> 1) Where are users reporting slowness in binpkg operations? |
177 |
> 2) What is the cause of the slowness? |
178 |
|
179 |
Those are optimizations to not cause slowness compared to the current |
180 |
format. Main use case is recreating package index which would require |
181 |
rereading the metadata of all binary packages. |
182 |
|
183 |
> Like I could easily see a potential user with many large binpkgs, and the |
184 |
> current implementation causing them issues because |
185 |
> they have to decompress and seek a bunch to read the metadata out of their |
186 |
> 1.2GB binpkg. But i'm pretty sure this isn't most users. |
187 |
> |
188 |
> |
189 |
> > |
190 |
> > Firstly, all metadata files are packed to the archive before data files. |
191 |
> > With a slightly customized unpacker, we can stop decompressing as soon |
192 |
> > as we're past metadata and avoid decompressing the whole archive. This |
193 |
> > will also make it possible to read metadata from remote files without |
194 |
> > fetching far past the compressed metadata block. |
195 |
> > |
196 |
> |
197 |
> So this seems to basically go against your goals of simple common tooling? |
198 |
|
199 |
No. My goal is to make it compatible with simple common tooling. You |
200 |
can still use the simple tooling to read/write them. The optimized |
201 |
tools are only needed to efficiently handle special use cases. |
202 |
|
203 |
> > Secondly, if we're up for some more tricks, we could technically split |
204 |
> > the tarball into metadata and data blocks compressed separately. This |
205 |
> > will need a bit of archiver customization but it will make it possible |
206 |
> > to decompress the metadata part without even touching compressed data, |
207 |
> > and to replace it without recompressing data. |
208 |
> > |
209 |
> > What's important is that both tricks proposed maintain backwards |
210 |
> > compatibility with regular compressed tarballs. That is, the user will |
211 |
> > still be able to extract it with regular archiving tools. |
212 |
> |
213 |
> |
214 |
> So my recollection is that debian uses common format AR files for the main |
215 |
> deb. |
216 |
> Then they have 2 compressed tarballs, one for metadata, and one for data. |
217 |
> |
218 |
> This format seems to jive with many of your requirements: |
219 |
> |
220 |
> - 'ar' can retrieve individual files from the archive. |
221 |
> - The deb file itself is not compressed, but the tarballs inside *are* |
222 |
> compressed. |
223 |
> - The metadata and data are compressed separately. |
224 |
> - Anyone can edit this with normal tooling (ar, tar) |
225 |
> |
226 |
> In short; why should we event a new format? |
227 |
|
228 |
Because nobody knows how to use 'ar', compared to how almost every |
229 |
Gentoo user can use 'tar' immediately? Of course we could alternatively |
230 |
just use a nested tarball but I wanted to keep the possibility |
231 |
of actually being able to 'tar -xf' it without having to extract nested |
232 |
archives. |
233 |
|
234 |
-- |
235 |
Best regards, |
236 |
Michał Górny |