Gentoo Archives: gentoo-dev

From: "Robin H. Johnson" <robbat2@g.o>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] [News item review] Portage rsync tree verification (v4)
Date: Mon, 29 Jan 2018 07:21:24
Message-Id: robbat2-20180129T070612-080997113Z@orbis-terrarum.net
In Reply to: Re: [gentoo-dev] [News item review] Portage rsync tree verification (v4) by Andrew Barchuk
1 On Sun, Jan 28, 2018 at 09:30:31PM +0100, Andrew Barchuk wrote:
2 > Hi everyone,
3 >
4 > > three possible solutions for splitting distfiles were listed:
5 > There's another option to use character ranges for each directory
6 > computed in a way to have the files distributed evenly. One way to do
7 > that is to use filename prefix of dynamic length so that each range
8 > holds the same number of files. E.g. we would have Ab/, Ap/, Ar/ but
9 > texlive-module-te/, texlive-module-th/, texlive-module-ti/. A similar
10 > but simpler option is to use file names as range bounds (the same way
11 > dictionaries use words to demarcate page bounds): each directory will
12 > have a name of the first file located inside. This way files will be
13 > distributed evenly and it's still easy to pick a correct directory where
14 > a file will be located manually.
15 This was discussed early on, but thank you for the reminder, as it got
16 dropped from later discussions.
17
18 > [snip code]
19 > Using the approach above the files will distributed evenly among the
20 > directories keeping the possibility to determine the directory for a
21 > specific file by hand. It's possible if necessary to keep the directory
22 > structure unchanged for very long time and it will likely stay
23 > well-balanced. Picking a directory for a file is very cheap. The only
24 > obvious downside I see is that it's necessary to know list of
25 > directories to pick the correct one (can be mitigated by caching the
26 > list of directories if important). If it's desirable to make directory
27 > names shorter or to look less like file names it's fairly easy to
28 > achieve by keeping only unique prefixes of directories. For example:
29 As for the problem you describe, one of the requirements in the
30 discussion is that given ONLY the file or filename, and NOTHING ELSE, it
31 should be possible to determine where in a hierarchy it should go. No
32 prior knowledge about the hierarchy was permitted. Some parties might
33 answer that you just need an index file then, but that means you have to
34 keep the index file in sync often.
35
36 It's a superbly readable result (in the general class of perfect hashes
37 based on lots of well-known input). The class of solution suffers
38 another problem in addition the one you noted: if input changes
39 sufficiently, then rebalancing is expensive/hard.
40
41 As a concrete example, say we add a new category for something something
42 with lots of common prefixes in distfiles.
43 dev-scratch/ as an example, where all distfiles start with 'scratch-'.
44 Unless we know up-front that we're going to add a thousand distfiles
45 here (not unreasonable, dev-python is ~1800 packages), they might start
46 by going into the 'sc' directory, but later we want them to be in
47 'scratch', as the tree is unweighted otherwise.
48
49 --
50 Robin Hugh Johnson
51 Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
52 E-Mail : robbat2@g.o
53 GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
54 GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136

Attachments

File name MIME type
signature.asc application/pgp-signature