Gentoo Archives: gentoo-dev

From: Mike Gilbert <floppym@g.o>
To: Gentoo Dev <gentoo-dev@l.g.o>
Subject: Re: [gentoo-dev] proposal: use only one hash function in manifest files
Date: Tue, 12 Apr 2022 12:45:39
Message-Id: CAJ0EP42JzSuCwMjLA_SdXh+ORAkWfeDRcVF0LD8urHjowEtGJg@mail.gmail.com
In Reply to: Re: [gentoo-dev] proposal: use only one hash function in manifest files by Joshua Kinard
1 On Mon, Apr 11, 2022 at 7:14 PM Joshua Kinard <kumba@g.o> wrote:
2 >
3 > On 4/5/2022 17:49, Jason A. Donenfeld wrote:
4 > > Hi Matt,
5 > >
6 > > On Tue, Apr 5, 2022 at 10:38 PM Matt Turner <mattst88@g.o> wrote:
7 > >>
8 > >> On Tue, Apr 5, 2022 at 12:30 PM Jason A. Donenfeld <zx2c4@g.o> wrote:
9 > >>> By the way, we're not currently _checking_ two hash functions during
10 > >>> src_prepare(), are we?
11 > >>
12 > >> I don't know, but the hash-checking is definitely checked before src_prepare().
13 > >
14 > > Er, during the builtin fetch phase. Anyway, you know what I meant. :)
15 > >
16 > > Anyway, looking at the portage source code, to answer my own question,
17 > > it looks like the file is actually being read twice and both hashes
18 > > computed. I would have at least expected an optimization like:
19 > >
20 > > hash1_init(&hash1);
21 > > hash2_init(&hash2);
22 > > for chunks in file:
23 > > hash1_update(&hash1, chunk);
24 > > hash2_update(&hash2, chunk);
25 > > hash1_final(&hash1, out1);
26 > > hash2_final(&hash2, out2);
27 > >
28 > > But actually what's happening is the even less efficient:
29 > >
30 > > hash1_init(&hash1);
31 > > for chunks in file:
32 > > hash1_update(&hash1, chunk);
33 > > hash1_final(&hash1, out1);
34 > > hash2_init(&hash2);
35 > > for chunks in file:
36 > > hash2_update(&hash2, chunk);
37 > > hash1_final(&hash2, out2);
38 > >
39 > > So the file winds up being open and read twice. For huge tarballs like
40 > > chromium or libreoffice...
41 > >
42 > > But either way you do it - the missed optimization above or the
43 > > unoptimized reality below - there's still twice as much work being
44 > > done. This is all unless I've misread the source code, which is
45 > > possible, so if somebody knows this code well and I'm wrong here,
46 > > please do speak up.
47 >
48 > Not to go off-topic, but where in Portage's source is this logic at? It
49 > seems like an easy fix for a slightly more efficient Portage.
50
51 I believe it's the portage.checksum.verify_all() function.
52
53 https://gitweb.gentoo.org/proj/portage.git/tree/lib/portage/checksum.py?h=portage-3.0.30#n471