Gentoo Archives: gentoo-dev

From: Madhu <enometh@××××.net>
To: gentoo-dev@l.g.o
Subject: [gentoo-dev] Re: About EGO_SUM
Date: Thu, 09 Jun 2022 06:18:30
Message-Id: 20220609.114841.133036565564988149.enometh@meer.net
In Reply to: Re: [gentoo-dev] About EGO_SUM by "Robin H. Johnson"
1 * "Robin H. Johnson" <robbat2-20220608T184338-394361540Z @orbis-terrarum.net> :
2 Wrote on Wed, 8 Jun 2022 20:42:48 +0000:
3 > EGO_SUM vs dependency tarballs:
4 > - bloats ebuilds
5 > - bloats Manifests
6 > - bloats metadata/md5-cache/ (SRC_URI etc)
7 > - doesn't bloat mirrors with gentoo-unique distfiles
8 > - EGO_SUM is verifiable/reproducible from Upstream Go systems
9 > - less downloads on upgrades (only changed Go deps, not entire dep tarballs)
10 >
11 > EGO_SUM data right now adds, to every user's system:
12 > - 2.6MB of text to ebuilds (340k after de-dupe)
13 > - 7MB of text to Manifests (2M after de-dupe)
14 > - 6.4MB+ of text to metadata/md5-cache (I don't have a easy way to
15 > calc deduped amount here)
16 > On the server side:
17 > - The sum total of Go distfiles mirrored on Gentoo mirrors right now
18 > is only 3.4GB.
19 > - less downloads
20 >
21 > Dependency tarballs:
22 > - Right now ~15GiB on each mirror, plus storage of the primary copy
23 > somewhere (dev.g.o right now, but not great)
24 > - Conservatively if the remaining EGO_SUM packages converted to Dep
25 > tarballs, it would need another 8GB each of primary location and
26 > mirrors.
27 > - larger downloads for users who DO want to upgrade a Go package (all
28 > new deps tarball even if only one or two deps changed)
29 > - must be preserved much longer, unless we can introduce a guaranteed
30 > way to regenerate them for any prior ebuild.
31 >
32 > I was trying to introduce a third option, but I haven't had the time to
33 > write an entire GLEP.
34 >
35 > The TL;DR is introducing a 2nd-level Manifest+metadata file, that tries
36 > to move just the metadata out of the tree, in a way that can be
37 > regenerated (specifically, a 1:1 reproducible creation from a given go.sum).
38 > It DOES need to contain slightly more data than the present Manifest,
39 > specifically a full SRC_URI entry for each file (upstream URI plus what
40 > to rename it to on Gentoo side)
41 >
42 > The 2nd-level Manifest would be listed as SRC_URI, and be handled in
43 > src_fetch/src_unpack. Download & verify the extra distfiles, against the
44 > Manifest checksum data (and for Golang against go.sum checksums).
45 >
46 > The Portage mirrordist code needs the most work in this case, as it
47 > would need to fetch the 2nd-level Manifests so it can populate Gentoo
48 > mastermirror with the distfiles mirrored from upstream.
49 >
50 > The storage costs for the proposed idea:
51 > - same 1:1 base distfile storage as EGO_SUM (e.g. upstream distfiles are
52 > mirrored 1:1 content, just different naming)
53 > - Probably 1 Metadata-Manifest file per ebuild $PVR (conceptually it
54 > could be split more or shared between some ebuilds/packages)
55 > - Main tree Manifests: 1 DIST entry per Metadata-Manifest in a given package
56 > - Main tree ebuilds: 1 line for the Metadata-Manifest in the ebuild.
57 > - metadata/md5-cache: 1 src_uri line!
58 > - mirrors: add the Metadata-Manifest
59
60 [Without claiming to have fully understood the proposal above: around
61 Apr 15th 22 I tried suggesting to WilliamH on IRC that perhaps portage
62 should implement the dirhash approach that go has taken to solve the
63 problem of upstream sources when they invented go.sum.
64
65 from hash.go in sources
66 go/src/cmd/vendor/golang.org/x/mod/sumdb/dirhash/hash.go
67
68 // Hash1 is "h1:" followed by the base64-encoded SHA-256 hash of a
69 summary prepared as if by the Unix command:find . -type f | sort |
70 sha256sum
71
72 loosely speaking the "manifest" could publish this dirhash of contents
73 of go-mod/cache (which would have been bundled in the -deps.tar.xz)
74
75 The immediate motivation was to avoid the network when I already had the
76 sources locally: instead of downloading a -deps.tar.xz I could create it
77 locally and dump it in distdir. portage would check the (hypothetically)
78 published dirhash and let it through. the local timestamps and uid in my
79 tarball and the upstream tarball wouldn't upset it.
80
81 One unchecked assumption is that go-mod/cache can be recreated by
82 unpacking sources. If so then with a notion of a "second level manifest"
83 (the equivalent of go.sum) the contents can be assembled without having
84 to store or download the actual -deps tarball.
85
86 I didn't get very far in convincing WilliamH of my need so I dropped
87 the idea. (I'm not sure if I'm being any clearer, if I'm missing
88 something, do let me know)