Gentoo Archives: gentoo-dev

From: Fabian Groffen <grobian@g.o>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] [pre-GLEP] Gentoo binary package container format
Date: Wed, 21 Nov 2018 10:46:06
Message-Id: 20181121104554.GB28829@gentoo.org
In Reply to: Re: [gentoo-dev] [pre-GLEP] Gentoo binary package container format by "Michał Górny"
1 On 21-11-2018 10:33:18 +0100, Michał Górny wrote:
2 > > > > > 2. **The format relies on obscure compressor feature of ignoring
3 > > > > > trailing garbage**. While this behavior is traditionally implemented
4 > > > > > by many compressors, the original reasons for it have become long
5 > > > > > irrelevant and it is not surprising that new compressors do not
6 > > > > > support it. In particular, Portage already hit this problem twice:
7 > > > > > once when users replaced bzip2 with parallel-capable pbzip2
8 > > > > > implementation [#PBZIP2]_, and the second time when support for zstd
9 > > > > > compressor was added [#ZSTD]_.
10 > > > >
11 > > > > I think this is actually the result of a rather opportunistic
12 > > > > implementation. The fault is that we chose to use an extension that
13 > > > > suggests the file is a regular compressed tarball.
14 > > > > When one detects that a file is xpak padded, it is trivial to feed the
15 > > > > decompressor just the relevant part of the datastream. The format
16 > > > > itself isn't bad, and doesn't rely on obscure behaviour.
17 > > >
18 > > > Except if you don't have the proper tools installed. In which case
19 > > > the 'opportunistic' behavior made it possible to extract the contents
20 > > > without special tools... except when it actually happens not to work
21 > > > anymore. Roy's reply indicates that there is actually interest in this
22 > > > design feature.
23 > >
24 > > Your point is that the format is broken (== relies on obscure compressor
25 > > feature). My point is that the format simply requires a special tool.
26 > > The fact that we prefer to use existing tools doesn't imply in any way
27 > > that the format is broken to me.
28 > > I think you should rewrite your point to mention that you don't want to
29 > > use a tool that doesn't exist in @system (?) to unpack a binpkg. My
30 > > guess is that you could use some head/tail magic in a script if the
31 > > trailing block is upsetting the decompressor.
32 > >
33 > > I'm not saying this may look ugly, I'm just saying that your point seems
34 > > biased.
35 >
36 > I've spent a significant effort rewriting those point to make it clear
37 > what the problem is, and separating it from other changes 'worth doing
38 > while we're changing stuff'. Hope that satisfies your nitpicking.
39
40 Yes it does, thank you.
41
42 > > > > > 3. **Placing metadata at the end of file makes partial fetches
43 > > > > > complex.** While it is technically possible to obtain package
44 > > > > > metadata remotely without fetching the whole package, it usually
45 > > > > > requires e.g. 2-3 HTTP requests with rather complex driver. For
46 > > > > > comparison, if metadata was placed at the beginning of the file,
47 > > > > > early-terminated pipeline with a single fetch request would suffice.
48 > > > >
49 > > > > I think this point needs to be quantified somewhat why it is so
50 > > > > important.
51 > > > > I may be wrong, but the average binpkg is small, <1MiB, bigger packages
52 > > > > are <50MiB.
53 > > > > So what is the gain to be saved here? A "few" MiBs for what operation
54 > > > > exactly? I say "few" because I know for some users this is actually not
55 > > > > just a blib before it's downloaded. So if this is possible to achieve,
56 > > > > in what scenarios is this going to be used (and is this often?).
57 > > >
58 > > > Last I checked, Gentoo aimed to support more users than the 'majority'
59 > > > of people with high-throughput Internet access. If there's no cost
60 > > > in doing things better, why not do them better?
61 > >
62 > > You didn't address the critical question, but instead just repeated what
63 > > I said.
64 > > So again, why do you need to read just the metadata?
65 >
66 > The original idea was to provide the ability of indexing remote packages
67 > without having a server-side cache available (or up-to-date). In order
68 > to do that, the package manager would need to fetch the metadata of all
69 > packages (but there's no necessity in fetching the whole packages).
70 > However, that's merely a possible future idea. It's not worth debating
71 > today.
72 >
73 > Today I really understood the point of avoiding premature optimization.
74 > Even if the change is practically zero-cost and harmless (as it's simply
75 > reordering files), it's going to cost you a lot of time because someone
76 > will keep nitpicking on it, even though any other order will not change
77 > anything.
78
79 Perhaps next time don't put as much emphasize on it. I can see now what
80 you aim for, but it simply raises more questions and concerns to me than
81 it resolves. There is nothing wrong with putting in such future
82 possibility though, if easily possible and not colliding with anything
83 else.
84
85 > > > > > 4. **Extending the format with OpenPGP signatures is non-trivial.**
86 > > > > > Depending on the implementation details, it either requires fetching
87 > > > > > additional detached signature, breaking backwards compatibility or
88 > > > > > introducing more custom logic to reassemble OpenPGP packets.
89 > > > >
90 > > > > I think one could add an extra key to the xpak that holds a gpg sig or
91 > > > > something. Perhaps this point is better phrased as that current binpkgs
92 > > > > don't have any validation options defined.
93 > > >
94 > > > ...which extra key would mean that the two disjoint implementations
95 > > > in use would need more custom code that extracts the signature,
96 > > > reconstructs signed data for verification and verifies it. Or, in other
97 > > > words, that user needs even more custom tooling to manually verify
98 > > > the package he just fetched.
99 > >
100 > > I don't see your point. If you define what the package format looks
101 > > like, you just need to implement that. There is no point in having a
102 > > binpkg format that Portage doesn't implement properly. Portage is
103 > > well-equipped to implement any of the approaches. A user should use
104 > > Portage to install a package. A poweruser could use a separate tool for
105 > > a scenario where he/she's in charge of keeping things sane. Relevancy?
106 > >
107 > > I just don't agree that extending the format is non-trivial. You seem
108 > > to have no arguments other than adding "custom logic", which is what you
109 > > eventually also do in the reference implementation of your new approach.
110 >
111 > The difference is that my format is transparent. You file(1) it, you
112 > see a .tar archive. You extract the archive, you see subarchives
113 > and .sig which are widely recognized. You don't have to read the spec,
114 > you don't have to get special tools. If you ever verified detached
115 > signature, you know how to proceed. If you didn't, you'll learn
116 > something you can reuse.
117
118 Totally agree.
119
120 > Now, implementing signatures on top of XPAK is more effort, and yields
121 > something that is more fragile and in the end doesn't benefit anyone.
122
123 I agree this would be more effort, and it'd get complicated in some aspects.
124 Whether noone benefits from it depends a bit on whether XPAK could
125 potentially give you performance boosts or memory/storage savings.
126
127 > > > > > 5. **Metadata is not compressed.** This is not a significant problem,
128 > > > > > it is just listed for completeness.
129 > > > > >
130 > > > > >
131 > > > > > Goals for a new container format
132 > > > > > --------------------------------
133 > > > > >
134 > > > > > The following goals have been set for a replacement format:
135 > > > > >
136 > > > > > 1. **The packages must remain contained in a single file.** As a matter
137 > > > > > of user convenience, it should be possible to transfer binary
138 > > > > > packages without having to use multiple files, and to install them
139 > > > > > from any location.
140 > > > > >
141 > > > > > 2. **The file format must be entirely based on common file formats,
142 > > > > > respecting best practices, with as little customization as necessary
143 > > > > > to satisfy the requirements.** In particular, it is unacceptable
144 > > > > > to create new binary formats.
145 > > > >
146 > > > > I take this as your personal opinion. I don't quite get why it is
147 > > > > unacceptable to create a new binary format though. In particular when
148 > > > > you're looking for efficiency, such format could serve your purposes.
149 > > > > As long as it's clearly defined, I don't see the problem with a binary
150 > > > > format either.
151 > > > > Could you add why it is you think binary formats are unacceptable here?
152 > > >
153 > > > Because custom binary formats require specialized tooling, and are
154 > > > a royal PITA when the user wants to do something that the author of
155 > > > specialized tooling just happened not to think worthwhile, or when
156 > > > the tooling is not available for some reason. And before you ask really
157 > > > silly questions, yes, I did fight binary packages over hex editor
158 > > > at some point.
159 > >
160 > > Which I still don't understand, to be frank. I think even Portage
161 > > exposes python APIs to get to the data.
162 >
163 > Compare the time needed to make a trivial (but unforeseen) change
164 > on a format that's transparent vs a format that requires you to learn
165 > its spec and/or API, write a program and debug it.
166
167 I was under the impression you could unpack a tbz2 into data and xpak,
168 then unpack both, modify the contents with an editor or whatever, and
169 then pack the whole stuff back into a tbz2 again. This can be done
170 worst case scenario by emerge -k <pkg>, modifying the vdb and quickpkg
171 <pkg> afterwards.
172 I know that with portage-utils you can do this easily with the qtbz2 and
173 qxpak commands. No need to do anything with a hex editor, or know
174 anything about how it's done.
175 Obvious advantage of your approach is that you don't need q* tools, but
176 can use tar instead. The editting is as trivial though. In your case
177 you need a special procedure to reconstruct the binpkg should you want
178 to keep your special properties (label, order) which equates to q* tools
179 somewhat.
180
181 > > > The most trivial case is an attempted recovery of a broken system.
182 > > > If you don't have Portage working and don't have portage-utils
183 > > > installed, do you really prefer a custom format which will require you
184 > > > to fetch and compile special tools? Or is one that can be processed
185 > > > with tools you're quite likely to have on every system, like tar?
186 > >
187 > > Well, I think the idea behind the original binpkg format was to use tar
188 > > directly on the files in emergency scenarios like these...
189 > > The assumption was bzip2 decompressor and tar being available.
190 > > I think it is an example of how you add something, while still allowing
191 > > to fallback on existing tools.
192 >
193 > Except progress in compressors has made it work less and less reliably.
194 > It's mostly an example how to be *clever*. However, being clever
195 > usually doesn't pay off in the long term, compared to doing things *in a
196 > simple way*.
197
198 We agree it is hackish, and we agree we can do without. You simply
199 exaggerate the problem, IMO, which mostly isn't there, because it works
200 fine today. It can also be solved today using shell tools.
201
202 % head -c `grep -abo 'XPAKPACK' $EPREFIX/usr/portage/packages/sys-apps/sed-4.5.tbz2 | sed 's/:.*$//'` $EPREFIX/usr/portage/packages/sys-apps/sed-4.5.tbz2 | tar -jxf -
203
204 results in no warnings/errors from bzip about trailing garbage, possible
205 thanks to the spec being smart enough about this.
206
207 Not having to do this, when under stress and pressure to restore a
208 system to get it back into production, is a plus. Though, in that
209 scenario the trailing garbage warning wouldn't have been that bad
210 either.
211
212 > > > > > 3. **The file format should provide for partial fetching of binary
213 > > > > > packages.** It should be possible to easily fetch and read
214 > > > > > the package metadata without having to download the whole package.
215 > > > >
216 > > > > Like above, what is the use-case here? Why would you want this? I
217 > > > > think I'm missing something here.
218 > > >
219 > > > Does this harm anything? Even if there's little real use for this, is
220 > > > there any harm in supporting it? Are we supposed to do things the other
221 > > > way around with no benefit just because you don't see any real use for
222 > > > it?
223 > >
224 > > Well, you make a huge point out of it. And if it isn't used, then why
225 > > bother so much about it. Then it just looks like you want to use it as
226 > > an argument to get rid of something you just don't like.
227 > >
228 > > In my opinion you better just say "hey I would like to implement this
229 > > binpkg format, because I think it would be easier to support with
230 > > minimal tools since it doesn't have custom features". I would have
231 > > nothing against that. Simple and elegant is nice, you don't need to
232 > > invent arguments for that, in my opinion.
233 >
234 > The spec is now more focused on that.
235
236 Thank you, much appreciated.
237
238 Fabian
239
240
241 --
242 Fabian Groffen
243 Gentoo on a different level

Attachments

File name MIME type
signature.asc application/pgp-signature

Replies