Gentoo Archives: gentoo-dev

From: Marius Mauch <genone@g.o>
To: gentoo-dev@l.g.o
Subject: [gentoo-dev] digest reorganization and enhancements
Date: Fri, 08 Oct 2004 16:43:33

This mail was first sent to dev-portage so it's written for that
audience, but it should be understandable for normal devs as well ;)

Short summary: current portage versions won't be able to handle any
modification to the digest format so we have to find a different way if
we want support for SHA1 or other algorithms.

And now the more detailed mail:

As was discussed again on -dev recently we need more digest algorithms
for file verification. One way that would be halfway compatible would be
to add additional lines use the same syntax as for the current md5
checksums to the digests and Manifests. However that means a lot of
redundancy as for each additional algorithm the filename and filesize
would be duplicated. It's also not trivial to do as there are several
functions dealing with digests and they all parse them a bit different
(I tried to add SHA1 support for digests and Manifests, took me about an
hour before I gave up). Also as soon as we add non-MD5 lines to digests
all currently released portage versions will blow up (as they will treat
the provided hash as a MD5 value, call it a bug if you want).

Instead I suggest we completely reorganize the digest system from
scratch by unifying the digests and the manifest files. As you all know
our tree is getting bigger and bigger with no end in sight. That
combined with the usual filesystem overhead causes a lot of wasted space
on many systems. By unifying the digests with the Manifests we could
kill >15.000 very small files at once (in the long run, this would
require compatible portage versions for all users).

As for the new syntax, it should allow us to add new digest algorithms
to portage without changing the syntax. My current idea would be that
for each file in the tree and in SRC_URI we have a line specifying:
- the filename
- the filesize
- n digests (consisting of algorithmname and the checksum)
To maintain compability and support future enhancements each of these
lines has to be prefixed with a (set of) keyword(s) (FILE or DIGEST or
Example lines could be:

SRC_URI portage-2.0.51_rc7.tar.bz2 274572 MD5 1234 SHA1 abcd RMD160 9876
EBUILD portage-2.0.51_rc7.ebuild 11806 MD5 xyz SHA1 fifteen

(using fake checksums for readability).

Maybe the system can also be extended to incorporate GLEP 25 without
adding a ton of new files, I'd need some input from Brian on that issue.

The biggest problem for this proposal is of course compability, a rough
transition plan could be:
- keep digests as they are now
- add the new format to Manifests (additional to the current MD5 lines)
- support the new format in 2.0.52 (use it optionally for verification)
- use it for verification in 2.1 by default (and drop support for the 
old system)
- exclude the old digests from `emerge --sync` in 2.1

And finally a summarizing list of reasons for the format:
- keep all checksums of a package in one place
- removes one level of indirection for signing
- digest generation currently recreates the Manifest anyway
- removing files from the tree
- allows for easy addition of new digest algorithms
- any syntax modification to the current digest files brings compability
problems with all currently existing portage versions while Manifest
changes do not
- potential to discover file collisions easier (currently you can have
the same file in two digests with different checksums, not a real
problem yet though)
- removes redundancy for common files

Let the discussions begin.