Gentoo Archives: gentoo-dev

From: "Michał Górny" <mgorny@g.o>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] [pre-GLEP] Gentoo binary package container format
Date: Wed, 21 Nov 2018 09:33:31
Message-Id: 1542792798.16894.17.camel@gentoo.org
In Reply to: Re: [gentoo-dev] [pre-GLEP] Gentoo binary package container format by Fabian Groffen
1 On Sun, 2018-11-18 at 12:00 +0100, Fabian Groffen wrote:
2 > On 18-11-2018 10:38:51 +0100, Michał Górny wrote:
3 > > On Sun, 2018-11-18 at 10:16 +0100, Fabian Groffen wrote:
4 > > > On 17-11-2018 12:21:40 +0100, Michał Górny wrote:
5 > > > > Problems with the current binary package format
6 > > > > -----------------------------------------------
7 > > > >
8 > > > > The following problems were identified with the package format currently
9 > > > > in use:
10 > > > >
11 > > > > 1. **The packages rely on custom binary archive format to store
12 > > > > metadata.** It is entirely Gentoo invented, and requires dedicated
13 > > > > tooling to work with it. In fact, the reference implementation
14 > > > > in Portage does not even include a CLI tool to work with tbz2
15 > > > > packages; an unofficial implementation is provided as part
16 > > > > of portage-utils toolkit [#PORTAGE-UTILS]_.
17 > > >
18 > > > I think you should rewrite this section to the argument that the
19 > > > metadata is hard to edit, and that there is only one tool to do so
20 > > > (except a python interface from Portage?).
21 > > > On a separate note, I don't think portage-utils can be considered
22 > > > "unofficial", it is a Gentoo official project as far as I am aware.
23 > >
24 > > In this context, Portage is 'official'. Portage-utils is a project
25 > > that's developed entirely separately from Portage and doesn't use
26 > > Portage APIs but instead reinvents everything. As such, it is easy for
27 > > the two to go out of sync. Or for one of them to have bugs that
28 > > the other one doesn't have (say, with endianness).
29 >
30 > I'm not sure if it's actually true, I was under the impression the same
31 > author(s) worked on the Portage as well as portage-utils code. Anyway,
32 > aren't quickpkg and emerge enough from a user's perspective?
33
34 Gentoo users have a wide perspective. Assuming that you can think of
35 all things the users need and you don't need to care beyond that
36 is plain wrong and results in Windows.
37
38 > > > > 2. **The format relies on obscure compressor feature of ignoring
39 > > > > trailing garbage**. While this behavior is traditionally implemented
40 > > > > by many compressors, the original reasons for it have become long
41 > > > > irrelevant and it is not surprising that new compressors do not
42 > > > > support it. In particular, Portage already hit this problem twice:
43 > > > > once when users replaced bzip2 with parallel-capable pbzip2
44 > > > > implementation [#PBZIP2]_, and the second time when support for zstd
45 > > > > compressor was added [#ZSTD]_.
46 > > >
47 > > > I think this is actually the result of a rather opportunistic
48 > > > implementation. The fault is that we chose to use an extension that
49 > > > suggests the file is a regular compressed tarball.
50 > > > When one detects that a file is xpak padded, it is trivial to feed the
51 > > > decompressor just the relevant part of the datastream. The format
52 > > > itself isn't bad, and doesn't rely on obscure behaviour.
53 > >
54 > > Except if you don't have the proper tools installed. In which case
55 > > the 'opportunistic' behavior made it possible to extract the contents
56 > > without special tools... except when it actually happens not to work
57 > > anymore. Roy's reply indicates that there is actually interest in this
58 > > design feature.
59 >
60 > Your point is that the format is broken (== relies on obscure compressor
61 > feature). My point is that the format simply requires a special tool.
62 > The fact that we prefer to use existing tools doesn't imply in any way
63 > that the format is broken to me.
64 > I think you should rewrite your point to mention that you don't want to
65 > use a tool that doesn't exist in @system (?) to unpack a binpkg. My
66 > guess is that you could use some head/tail magic in a script if the
67 > trailing block is upsetting the decompressor.
68 >
69 > I'm not saying this may look ugly, I'm just saying that your point seems
70 > biased.
71
72 I've spent a significant effort rewriting those point to make it clear
73 what the problem is, and separating it from other changes 'worth doing
74 while we're changing stuff'. Hope that satisfies your nitpicking.
75
76 > > > > 3. **Placing metadata at the end of file makes partial fetches
77 > > > > complex.** While it is technically possible to obtain package
78 > > > > metadata remotely without fetching the whole package, it usually
79 > > > > requires e.g. 2-3 HTTP requests with rather complex driver. For
80 > > > > comparison, if metadata was placed at the beginning of the file,
81 > > > > early-terminated pipeline with a single fetch request would suffice.
82 > > >
83 > > > I think this point needs to be quantified somewhat why it is so
84 > > > important.
85 > > > I may be wrong, but the average binpkg is small, <1MiB, bigger packages
86 > > > are <50MiB.
87 > > > So what is the gain to be saved here? A "few" MiBs for what operation
88 > > > exactly? I say "few" because I know for some users this is actually not
89 > > > just a blib before it's downloaded. So if this is possible to achieve,
90 > > > in what scenarios is this going to be used (and is this often?).
91 > >
92 > > Last I checked, Gentoo aimed to support more users than the 'majority'
93 > > of people with high-throughput Internet access. If there's no cost
94 > > in doing things better, why not do them better?
95 >
96 > You didn't address the critical question, but instead just repeated what
97 > I said.
98 > So again, why do you need to read just the metadata?
99
100 The original idea was to provide the ability of indexing remote packages
101 without having a server-side cache available (or up-to-date). In order
102 to do that, the package manager would need to fetch the metadata of all
103 packages (but there's no necessity in fetching the whole packages).
104 However, that's merely a possible future idea. It's not worth debating
105 today.
106
107 Today I really understood the point of avoiding premature optimization.
108 Even if the change is practically zero-cost and harmless (as it's simply
109 reordering files), it's going to cost you a lot of time because someone
110 will keep nitpicking on it, even though any other order will not change
111 anything.
112
113 > > > > 4. **Extending the format with OpenPGP signatures is non-trivial.**
114 > > > > Depending on the implementation details, it either requires fetching
115 > > > > additional detached signature, breaking backwards compatibility or
116 > > > > introducing more custom logic to reassemble OpenPGP packets.
117 > > >
118 > > > I think one could add an extra key to the xpak that holds a gpg sig or
119 > > > something. Perhaps this point is better phrased as that current binpkgs
120 > > > don't have any validation options defined.
121 > >
122 > > ...which extra key would mean that the two disjoint implementations
123 > > in use would need more custom code that extracts the signature,
124 > > reconstructs signed data for verification and verifies it. Or, in other
125 > > words, that user needs even more custom tooling to manually verify
126 > > the package he just fetched.
127 >
128 > I don't see your point. If you define what the package format looks
129 > like, you just need to implement that. There is no point in having a
130 > binpkg format that Portage doesn't implement properly. Portage is
131 > well-equipped to implement any of the approaches. A user should use
132 > Portage to install a package. A poweruser could use a separate tool for
133 > a scenario where he/she's in charge of keeping things sane. Relevancy?
134 >
135 > I just don't agree that extending the format is non-trivial. You seem
136 > to have no arguments other than adding "custom logic", which is what you
137 > eventually also do in the reference implementation of your new approach.
138
139 The difference is that my format is transparent. You file(1) it, you
140 see a .tar archive. You extract the archive, you see subarchives
141 and .sig which are widely recognized. You don't have to read the spec,
142 you don't have to get special tools. If you ever verified detached
143 signature, you know how to proceed. If you didn't, you'll learn
144 something you can reuse.
145
146 Now, implementing signatures on top of XPAK is more effort, and yields
147 something that is more fragile and in the end doesn't benefit anyone.
148
149 >
150 > > > > 5. **Metadata is not compressed.** This is not a significant problem,
151 > > > > it is just listed for completeness.
152 > > > >
153 > > > >
154 > > > > Goals for a new container format
155 > > > > --------------------------------
156 > > > >
157 > > > > The following goals have been set for a replacement format:
158 > > > >
159 > > > > 1. **The packages must remain contained in a single file.** As a matter
160 > > > > of user convenience, it should be possible to transfer binary
161 > > > > packages without having to use multiple files, and to install them
162 > > > > from any location.
163 > > > >
164 > > > > 2. **The file format must be entirely based on common file formats,
165 > > > > respecting best practices, with as little customization as necessary
166 > > > > to satisfy the requirements.** In particular, it is unacceptable
167 > > > > to create new binary formats.
168 > > >
169 > > > I take this as your personal opinion. I don't quite get why it is
170 > > > unacceptable to create a new binary format though. In particular when
171 > > > you're looking for efficiency, such format could serve your purposes.
172 > > > As long as it's clearly defined, I don't see the problem with a binary
173 > > > format either.
174 > > > Could you add why it is you think binary formats are unacceptable here?
175 > >
176 > > Because custom binary formats require specialized tooling, and are
177 > > a royal PITA when the user wants to do something that the author of
178 > > specialized tooling just happened not to think worthwhile, or when
179 > > the tooling is not available for some reason. And before you ask really
180 > > silly questions, yes, I did fight binary packages over hex editor
181 > > at some point.
182 >
183 > Which I still don't understand, to be frank. I think even Portage
184 > exposes python APIs to get to the data.
185
186 Compare the time needed to make a trivial (but unforeseen) change
187 on a format that's transparent vs a format that requires you to learn
188 its spec and/or API, write a program and debug it.
189
190 > > The most trivial case is an attempted recovery of a broken system.
191 > > If you don't have Portage working and don't have portage-utils
192 > > installed, do you really prefer a custom format which will require you
193 > > to fetch and compile special tools? Or is one that can be processed
194 > > with tools you're quite likely to have on every system, like tar?
195 >
196 > Well, I think the idea behind the original binpkg format was to use tar
197 > directly on the files in emergency scenarios like these...
198 > The assumption was bzip2 decompressor and tar being available.
199 > I think it is an example of how you add something, while still allowing
200 > to fallback on existing tools.
201
202 Except progress in compressors has made it work less and less reliably.
203 It's mostly an example how to be *clever*. However, being clever
204 usually doesn't pay off in the long term, compared to doing things *in a
205 simple way*.
206
207 > > > > 3. **The file format should provide for partial fetching of binary
208 > > > > packages.** It should be possible to easily fetch and read
209 > > > > the package metadata without having to download the whole package.
210 > > >
211 > > > Like above, what is the use-case here? Why would you want this? I
212 > > > think I'm missing something here.
213 > >
214 > > Does this harm anything? Even if there's little real use for this, is
215 > > there any harm in supporting it? Are we supposed to do things the other
216 > > way around with no benefit just because you don't see any real use for
217 > > it?
218 >
219 > Well, you make a huge point out of it. And if it isn't used, then why
220 > bother so much about it. Then it just looks like you want to use it as
221 > an argument to get rid of something you just don't like.
222 >
223 > In my opinion you better just say "hey I would like to implement this
224 > binpkg format, because I think it would be easier to support with
225 > minimal tools since it doesn't have custom features". I would have
226 > nothing against that. Simple and elegant is nice, you don't need to
227 > invent arguments for that, in my opinion.
228
229 The spec is now more focused on that.
230
231 >
232 > Fabian
233 >
234 > > > > 4. **The file format must provide support for OpenPGP signatures.**
235 > > > > Preferably, it should use standard OpenPGP message formats.
236 > > > >
237 > > > > 5. **The file format must allow for efficient metadata updates.**
238 > > > > In particular, it should be possible to update the metadata without
239 > > > > having to recompress package files.
240 > > > >
241 > > > > 6. **The file format should account for easy recognition both through
242 > > > > filename and through contents.** Preferably, it should have distinct
243 > > > > features making it possible to detect it via file(1).
244 > > > >
245 > > > > 7. **The file format should allow for metadata compression.**
246 > > > >
247 > > > > 8. **The file format should make future extensions easily possible
248 > > > > without breaking backwards compatibility.**
249 > > >
250 > > >
251 > >
252 > > --
253 > > Best regards,
254 > > Michał Górny
255 >
256 >
257 >
258
259 --
260 Best regards,
261 Michał Górny

Attachments

File name MIME type
signature.asc application/pgp-signature

Replies

Subject Author
Re: [gentoo-dev] [pre-GLEP] Gentoo binary package container format Fabian Groffen <grobian@g.o>