Gentoo Archives: gentoo-dev

From: Martin Pool <mbp@×××××.org>
To: gentoo-dev@g.o
Subject: [gentoo-dev] Re: proposed md5sum change
Date: Mon, 23 Jun 2003 02:41:58
On Wed, 11 Jun 2003 11:02:02 -0500, Brian Harring wrote:

> Hola all, > Straight to the point, I propose instead of md5summing the compressed > distfile, we md5sum the actual data, the tarball.
Speaking as somebody who has worked on rsync and librsync: I agree, I think that would be an big improvement. The uncompressed form is the natural and efficient place to do delta compression. This implies that the client, after applying a patch, ends up with an uncompressed (e.g. .tar) file. Making the client recompress it is wasteful, because compression is expensive and in any case it's just going to be uncompressed and extracted. Not only is it wasteful, but it's hard to do correctly. As other people have noted, compression is not very reproducible. This implies that the script which unpacks and builds the source needs to be able to accept the unpacked form rather than the packed form as at present. That doesn't sound terribly hard. Some people might want to store packages in compressed form because they're low on disk, and so might want to bzip them up again after applying the patch. On the other hand, some people might want to keep them uncompressed because their CPU is slow. On the third hand, some people might want to *recompress* everything into bz2 even if it was originally .gz. Any of these can be supported through some future mechanism; they don't need to determine the download system. Seemant Kuleen wrote:
> Now, the promised concern bit. Unfortunately, while the majority of the > packages do come in a compressed tarball format, there are many (enough to > make it a corner case of some concern) packages which do not. Off the top > of my head, I can think of .Z (forget which package), .rpm > (redhat-artwork), .bin (realplayer). And in some cases, we just get an > uncompressed README file in the SRC_URI (or the wacom.c file in xfree, > though I'm not certain of it right this moment).
.Z files can be uncompressed and handled as for gzip (I think gzip handles them in fact.) .zip, .rpm, or self-extracting .exe files can also be uncompressed and diffd, at least in principle. Uncompressed READMEs, patches or .c files are just too easy. :-) If you don't recognize the format, you can try to do a delta on the binary form. If the delta is too big, drop it. Experience on Debian has shown that compiled binaries in general do not delta-compress very well, so I think not being able to uncompress them is not a terrible thing. The point: Gentoo should distribute the md5sums for both the compressed and uncompressed forms of packages. They are checked in that order; either is sufficient. Regular non-delta downloads will proceed as usual, and the md5sum can be checked immediately after download. There is no added cost. Patch downloads can be done by - download xdelta - uncompress old file, pipe it into 'xdelta patch', store the result - check result against uncompressed MD5sum As far as I can see this removes any need for a special deltup file format. Just simply send xdeltas. A great advantage is that xdeltas are useful to people other than Gentoo, so people upstream or mirrors may be more willing to distribute them alongside the original source. Much as I love the idea of deltup, I think the current code is a bit messy and making up a new format is unjustified.
> In terms of performance of the md5summing, it would still likely be i/o > limited- decompression would be done in memory after all.
The approach above is much *more* efficient than deltup, which makes an extra roundtrip to bz2 format. What have I missed? -- Martin If you don't know how to code, then you don't know how to design the software either. Period. You can only cause trouble. -- Havoc Pennington, -- gentoo-dev@g.o mailing list


Subject Author
Re: [gentoo-dev] Re: proposed md5sum change bdharring <bdharring@××××.edu>