1 |
On 21-11-2018 10:33:18 +0100, Michał Górny wrote: |
2 |
> > > > > 2. **The format relies on obscure compressor feature of ignoring |
3 |
> > > > > trailing garbage**. While this behavior is traditionally implemented |
4 |
> > > > > by many compressors, the original reasons for it have become long |
5 |
> > > > > irrelevant and it is not surprising that new compressors do not |
6 |
> > > > > support it. In particular, Portage already hit this problem twice: |
7 |
> > > > > once when users replaced bzip2 with parallel-capable pbzip2 |
8 |
> > > > > implementation [#PBZIP2]_, and the second time when support for zstd |
9 |
> > > > > compressor was added [#ZSTD]_. |
10 |
> > > > |
11 |
> > > > I think this is actually the result of a rather opportunistic |
12 |
> > > > implementation. The fault is that we chose to use an extension that |
13 |
> > > > suggests the file is a regular compressed tarball. |
14 |
> > > > When one detects that a file is xpak padded, it is trivial to feed the |
15 |
> > > > decompressor just the relevant part of the datastream. The format |
16 |
> > > > itself isn't bad, and doesn't rely on obscure behaviour. |
17 |
> > > |
18 |
> > > Except if you don't have the proper tools installed. In which case |
19 |
> > > the 'opportunistic' behavior made it possible to extract the contents |
20 |
> > > without special tools... except when it actually happens not to work |
21 |
> > > anymore. Roy's reply indicates that there is actually interest in this |
22 |
> > > design feature. |
23 |
> > |
24 |
> > Your point is that the format is broken (== relies on obscure compressor |
25 |
> > feature). My point is that the format simply requires a special tool. |
26 |
> > The fact that we prefer to use existing tools doesn't imply in any way |
27 |
> > that the format is broken to me. |
28 |
> > I think you should rewrite your point to mention that you don't want to |
29 |
> > use a tool that doesn't exist in @system (?) to unpack a binpkg. My |
30 |
> > guess is that you could use some head/tail magic in a script if the |
31 |
> > trailing block is upsetting the decompressor. |
32 |
> > |
33 |
> > I'm not saying this may look ugly, I'm just saying that your point seems |
34 |
> > biased. |
35 |
> |
36 |
> I've spent a significant effort rewriting those point to make it clear |
37 |
> what the problem is, and separating it from other changes 'worth doing |
38 |
> while we're changing stuff'. Hope that satisfies your nitpicking. |
39 |
|
40 |
Yes it does, thank you. |
41 |
|
42 |
> > > > > 3. **Placing metadata at the end of file makes partial fetches |
43 |
> > > > > complex.** While it is technically possible to obtain package |
44 |
> > > > > metadata remotely without fetching the whole package, it usually |
45 |
> > > > > requires e.g. 2-3 HTTP requests with rather complex driver. For |
46 |
> > > > > comparison, if metadata was placed at the beginning of the file, |
47 |
> > > > > early-terminated pipeline with a single fetch request would suffice. |
48 |
> > > > |
49 |
> > > > I think this point needs to be quantified somewhat why it is so |
50 |
> > > > important. |
51 |
> > > > I may be wrong, but the average binpkg is small, <1MiB, bigger packages |
52 |
> > > > are <50MiB. |
53 |
> > > > So what is the gain to be saved here? A "few" MiBs for what operation |
54 |
> > > > exactly? I say "few" because I know for some users this is actually not |
55 |
> > > > just a blib before it's downloaded. So if this is possible to achieve, |
56 |
> > > > in what scenarios is this going to be used (and is this often?). |
57 |
> > > |
58 |
> > > Last I checked, Gentoo aimed to support more users than the 'majority' |
59 |
> > > of people with high-throughput Internet access. If there's no cost |
60 |
> > > in doing things better, why not do them better? |
61 |
> > |
62 |
> > You didn't address the critical question, but instead just repeated what |
63 |
> > I said. |
64 |
> > So again, why do you need to read just the metadata? |
65 |
> |
66 |
> The original idea was to provide the ability of indexing remote packages |
67 |
> without having a server-side cache available (or up-to-date). In order |
68 |
> to do that, the package manager would need to fetch the metadata of all |
69 |
> packages (but there's no necessity in fetching the whole packages). |
70 |
> However, that's merely a possible future idea. It's not worth debating |
71 |
> today. |
72 |
> |
73 |
> Today I really understood the point of avoiding premature optimization. |
74 |
> Even if the change is practically zero-cost and harmless (as it's simply |
75 |
> reordering files), it's going to cost you a lot of time because someone |
76 |
> will keep nitpicking on it, even though any other order will not change |
77 |
> anything. |
78 |
|
79 |
Perhaps next time don't put as much emphasize on it. I can see now what |
80 |
you aim for, but it simply raises more questions and concerns to me than |
81 |
it resolves. There is nothing wrong with putting in such future |
82 |
possibility though, if easily possible and not colliding with anything |
83 |
else. |
84 |
|
85 |
> > > > > 4. **Extending the format with OpenPGP signatures is non-trivial.** |
86 |
> > > > > Depending on the implementation details, it either requires fetching |
87 |
> > > > > additional detached signature, breaking backwards compatibility or |
88 |
> > > > > introducing more custom logic to reassemble OpenPGP packets. |
89 |
> > > > |
90 |
> > > > I think one could add an extra key to the xpak that holds a gpg sig or |
91 |
> > > > something. Perhaps this point is better phrased as that current binpkgs |
92 |
> > > > don't have any validation options defined. |
93 |
> > > |
94 |
> > > ...which extra key would mean that the two disjoint implementations |
95 |
> > > in use would need more custom code that extracts the signature, |
96 |
> > > reconstructs signed data for verification and verifies it. Or, in other |
97 |
> > > words, that user needs even more custom tooling to manually verify |
98 |
> > > the package he just fetched. |
99 |
> > |
100 |
> > I don't see your point. If you define what the package format looks |
101 |
> > like, you just need to implement that. There is no point in having a |
102 |
> > binpkg format that Portage doesn't implement properly. Portage is |
103 |
> > well-equipped to implement any of the approaches. A user should use |
104 |
> > Portage to install a package. A poweruser could use a separate tool for |
105 |
> > a scenario where he/she's in charge of keeping things sane. Relevancy? |
106 |
> > |
107 |
> > I just don't agree that extending the format is non-trivial. You seem |
108 |
> > to have no arguments other than adding "custom logic", which is what you |
109 |
> > eventually also do in the reference implementation of your new approach. |
110 |
> |
111 |
> The difference is that my format is transparent. You file(1) it, you |
112 |
> see a .tar archive. You extract the archive, you see subarchives |
113 |
> and .sig which are widely recognized. You don't have to read the spec, |
114 |
> you don't have to get special tools. If you ever verified detached |
115 |
> signature, you know how to proceed. If you didn't, you'll learn |
116 |
> something you can reuse. |
117 |
|
118 |
Totally agree. |
119 |
|
120 |
> Now, implementing signatures on top of XPAK is more effort, and yields |
121 |
> something that is more fragile and in the end doesn't benefit anyone. |
122 |
|
123 |
I agree this would be more effort, and it'd get complicated in some aspects. |
124 |
Whether noone benefits from it depends a bit on whether XPAK could |
125 |
potentially give you performance boosts or memory/storage savings. |
126 |
|
127 |
> > > > > 5. **Metadata is not compressed.** This is not a significant problem, |
128 |
> > > > > it is just listed for completeness. |
129 |
> > > > > |
130 |
> > > > > |
131 |
> > > > > Goals for a new container format |
132 |
> > > > > -------------------------------- |
133 |
> > > > > |
134 |
> > > > > The following goals have been set for a replacement format: |
135 |
> > > > > |
136 |
> > > > > 1. **The packages must remain contained in a single file.** As a matter |
137 |
> > > > > of user convenience, it should be possible to transfer binary |
138 |
> > > > > packages without having to use multiple files, and to install them |
139 |
> > > > > from any location. |
140 |
> > > > > |
141 |
> > > > > 2. **The file format must be entirely based on common file formats, |
142 |
> > > > > respecting best practices, with as little customization as necessary |
143 |
> > > > > to satisfy the requirements.** In particular, it is unacceptable |
144 |
> > > > > to create new binary formats. |
145 |
> > > > |
146 |
> > > > I take this as your personal opinion. I don't quite get why it is |
147 |
> > > > unacceptable to create a new binary format though. In particular when |
148 |
> > > > you're looking for efficiency, such format could serve your purposes. |
149 |
> > > > As long as it's clearly defined, I don't see the problem with a binary |
150 |
> > > > format either. |
151 |
> > > > Could you add why it is you think binary formats are unacceptable here? |
152 |
> > > |
153 |
> > > Because custom binary formats require specialized tooling, and are |
154 |
> > > a royal PITA when the user wants to do something that the author of |
155 |
> > > specialized tooling just happened not to think worthwhile, or when |
156 |
> > > the tooling is not available for some reason. And before you ask really |
157 |
> > > silly questions, yes, I did fight binary packages over hex editor |
158 |
> > > at some point. |
159 |
> > |
160 |
> > Which I still don't understand, to be frank. I think even Portage |
161 |
> > exposes python APIs to get to the data. |
162 |
> |
163 |
> Compare the time needed to make a trivial (but unforeseen) change |
164 |
> on a format that's transparent vs a format that requires you to learn |
165 |
> its spec and/or API, write a program and debug it. |
166 |
|
167 |
I was under the impression you could unpack a tbz2 into data and xpak, |
168 |
then unpack both, modify the contents with an editor or whatever, and |
169 |
then pack the whole stuff back into a tbz2 again. This can be done |
170 |
worst case scenario by emerge -k <pkg>, modifying the vdb and quickpkg |
171 |
<pkg> afterwards. |
172 |
I know that with portage-utils you can do this easily with the qtbz2 and |
173 |
qxpak commands. No need to do anything with a hex editor, or know |
174 |
anything about how it's done. |
175 |
Obvious advantage of your approach is that you don't need q* tools, but |
176 |
can use tar instead. The editting is as trivial though. In your case |
177 |
you need a special procedure to reconstruct the binpkg should you want |
178 |
to keep your special properties (label, order) which equates to q* tools |
179 |
somewhat. |
180 |
|
181 |
> > > The most trivial case is an attempted recovery of a broken system. |
182 |
> > > If you don't have Portage working and don't have portage-utils |
183 |
> > > installed, do you really prefer a custom format which will require you |
184 |
> > > to fetch and compile special tools? Or is one that can be processed |
185 |
> > > with tools you're quite likely to have on every system, like tar? |
186 |
> > |
187 |
> > Well, I think the idea behind the original binpkg format was to use tar |
188 |
> > directly on the files in emergency scenarios like these... |
189 |
> > The assumption was bzip2 decompressor and tar being available. |
190 |
> > I think it is an example of how you add something, while still allowing |
191 |
> > to fallback on existing tools. |
192 |
> |
193 |
> Except progress in compressors has made it work less and less reliably. |
194 |
> It's mostly an example how to be *clever*. However, being clever |
195 |
> usually doesn't pay off in the long term, compared to doing things *in a |
196 |
> simple way*. |
197 |
|
198 |
We agree it is hackish, and we agree we can do without. You simply |
199 |
exaggerate the problem, IMO, which mostly isn't there, because it works |
200 |
fine today. It can also be solved today using shell tools. |
201 |
|
202 |
% head -c `grep -abo 'XPAKPACK' $EPREFIX/usr/portage/packages/sys-apps/sed-4.5.tbz2 | sed 's/:.*$//'` $EPREFIX/usr/portage/packages/sys-apps/sed-4.5.tbz2 | tar -jxf - |
203 |
|
204 |
results in no warnings/errors from bzip about trailing garbage, possible |
205 |
thanks to the spec being smart enough about this. |
206 |
|
207 |
Not having to do this, when under stress and pressure to restore a |
208 |
system to get it back into production, is a plus. Though, in that |
209 |
scenario the trailing garbage warning wouldn't have been that bad |
210 |
either. |
211 |
|
212 |
> > > > > 3. **The file format should provide for partial fetching of binary |
213 |
> > > > > packages.** It should be possible to easily fetch and read |
214 |
> > > > > the package metadata without having to download the whole package. |
215 |
> > > > |
216 |
> > > > Like above, what is the use-case here? Why would you want this? I |
217 |
> > > > think I'm missing something here. |
218 |
> > > |
219 |
> > > Does this harm anything? Even if there's little real use for this, is |
220 |
> > > there any harm in supporting it? Are we supposed to do things the other |
221 |
> > > way around with no benefit just because you don't see any real use for |
222 |
> > > it? |
223 |
> > |
224 |
> > Well, you make a huge point out of it. And if it isn't used, then why |
225 |
> > bother so much about it. Then it just looks like you want to use it as |
226 |
> > an argument to get rid of something you just don't like. |
227 |
> > |
228 |
> > In my opinion you better just say "hey I would like to implement this |
229 |
> > binpkg format, because I think it would be easier to support with |
230 |
> > minimal tools since it doesn't have custom features". I would have |
231 |
> > nothing against that. Simple and elegant is nice, you don't need to |
232 |
> > invent arguments for that, in my opinion. |
233 |
> |
234 |
> The spec is now more focused on that. |
235 |
|
236 |
Thank you, much appreciated. |
237 |
|
238 |
Fabian |
239 |
|
240 |
|
241 |
-- |
242 |
Fabian Groffen |
243 |
Gentoo on a different level |