Gentoo Archives: gentoo-dev

From: "Tiziano Müller" <dev-zero@g.o>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation
Date: Sun, 08 Feb 2009 21:49:10
Message-Id: 1234129737.18160.191.camel@localhost
In Reply to: Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation by Zac Medico
1 Am Sonntag, den 08.02.2009, 12:36 -0800 schrieb Zac Medico:
2 > -----BEGIN PGP SIGNED MESSAGE-----
3 > Hash: SHA1
4 >
5 > Tiziano Müller wrote:
6 > > Am Sonntag, den 08.02.2009, 00:59 -0800 schrieb Zac Medico:
7 > >> -----BEGIN PGP SIGNED MESSAGE-----
8 > >> Hash: SHA1
9 > >>
10 > >> Tiziano Müller wrote:
11 > >>> Am Samstag, den 07.02.2009, 15:23 -0800 schrieb Zac Medico:
12 > >>>> -----BEGIN PGP SIGNED MESSAGE-----
13 > >>>> Hash: SHA1
14 > >>>>
15 > >>>> Tiziano Müller wrote:
16 > >>>>> Am Montag, den 02.02.2009, 12:34 -0800 schrieb Zac Medico:
17 > >>>> I like that idea. That way it's not necessary to bump the EAPI in
18 > >>>> order to change the hash function. So, a typical DIGESTS value might
19 > >>>> look like this:
20 > > You still have to bump the EAPI in case you want to use a new hash not
21 > > already available now (like SHA-3). The advantage of noting the used
22 > > hash is that new PMs can handle old metadata cache.
23 >
24 > That's true.
25 >
26 > >>>> SHA1 02021be38b a28b191904 3992945426 6ec21b29a3
27 > >>> Sleeping over it again I don't think that truncating a hash is a good
28 > >>> idea (truncating it from 40 to 10 digits makes the possibility of
29 > >>> collisions much much higher).
30 > >> The probability of collision is much higher, but it's still
31 > >> relatively small. Given the "avalanche effect" that is typical of
32 > >> cryptographic hash functions, it's extremely unlikely that collision
33 > >> will occur in such a way that it will cause a problem for cache
34 > >> validation.
35 > > The "avalanche effect" as I understood it is required for a hash
36 > > function to avoid simple calculations of collisions (what the diffusion
37 > > is for crypto algorithms). So, small changes should affect as many
38 > > numbers in the hash as possible. But you don't have only small changes
39 > > here in case somebody patches an eclass, so, the only thing which counts
40 > > is the probability of a collision.
41 >
42 > Well, the avalanche effect helps in the sense that the leftmost 10
43 > digits would serve approximately as well as any other 10 digits out
44 > of all of them. But you're right about the probability of a
45 > collision being what really matters. With 10 hex digits, we've got a
46 > space of 16^10 = 1.1e12 possible combinations. Given a space that
47 > large, the probability of a collision pretty small.
48 >
49 > >>> But if you want to go this way, I'd say you should use something like
50 > >>> SHA1t (t for truncated) to make sure we can use full hashes once we feel
51 > >>> it's appropriate.
52 > >> We could, but I think SHA1 would also be fine since one can infer
53 > >> from the length of the string that it's been truncated.
54 > > No, guessing is a bad thing here because it could be truncated because
55 > > of faulty metadata. But the main motivation is that if you write SHA1
56 > > everyone reading it expects it to be a full SHA1 hash, which it isn't.
57 >
58 > Well, if the metadata is faulty then the digests are unlikely to
59 > match and the cache will be discarded anyway as invalid. However, I
60 > think your point is still somewhat valid, so SHA1t is fine with me
61 > if that makes more people happy. Does anyone else have a preference
62 > here?
63 >
64 > > But if your target is to reduce the size of the metadata cache, why
65 > > store the hashes of the eclasses in the ebuild's metadata and not in a
66 > > seperate dir? They have to be the same for every ebuild, don't they?
67 > > In case you have an average number of eclasses which is bigger than 4,
68 > > you can even store the full hash with less space used than with
69 > > truncated hashes for all eclasses.
70 >
71 > The problem with having eclass integrity data shared in a separate
72 > file is that it creates a requirement for all cache entries which
73 > reference the same eclasses to be consistent with one another. This
74 > means that a single cache entry can no longer be updated atomically.
75 > For example, before updating the shared eclass integrity data, you'd
76 > want to make sure that you first discard all of the cache entries
77 > which reference it. Although it can be done this way, I think it's
78 > much more convenient to have all of the integrity data encapsulated
79 > within each individual cache entry.
80 Ok, let me see if I get this: Since parts of the content of a
81 metadata-entry (like the DEPEND/RDEPEND vars) depend on the contents of
82 the eclass used by the time a cache entry got generated, you want to
83 store the eclass' hash in the ebuild entry to make sure the entry gets
84 invalidated once the eclass changes. Is that correct?
85
86
87 --
88 -------------------------------------------------------
89 Tiziano Müller
90 Gentoo Linux Developer, Council Member
91 Areas of responsibility:
92 Samba, PostgreSQL, CPP, Python, sysadmin
93 E-Mail : dev-zero@g.o
94 GnuPG FP : F327 283A E769 2E36 18D5 4DE2 1B05 6A63 AE9C 1E30

Attachments

File name MIME type
signature.asc application/pgp-signature

Replies