Gentoo Archives: gentoo-dev

From: "Michał Górny" <mgorny@g.o>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
Date: Mon, 29 Jan 2018 05:37:04
Message-Id: 1517204212.867.5.camel@gentoo.org
In Reply to: Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure by Andrew Barchuk
1 W dniu nie, 28.01.2018 o godzinie 21∶43 +0100, użytkownik Andrew Barchuk
2 napisał:
3 > [my apologies for posting the message to a wrong thread before]
4 >
5 > Hi everyone,
6 >
7 > > three possible solutions for splitting distfiles were listed:
8 > >
9 > > a. using initial portion of filename,
10 > >
11 > > b. using initial portion of file hash,
12 > >
13 > > c. using initial portion of filename hash.
14 > >
15 > > The significant advantage of the filename option was simplicity. With
16 > > that solution, the users could easily determine the correct subdirectory
17 > > themselves. However, it's significant disadvantage was very uneven
18 > > shuffling of data. In particular, the TeΧ Live packages alone count
19 > > almost 23500 distfiles and all use a common prefix, making it impossible
20 > > to split them further.
21 > >
22 > > The alternate option of using file hash has the advantage of having
23 > > a more balanced split.
24 >
25 >
26 > There's another option to use character ranges for each directory
27 > computed in a way to have the files distributed evenly. One way to do
28 > that is to use filename prefix of dynamic length so that each range
29 > holds the same number of files. E.g. we would have Ab/, Ap/, Ar/ but
30 > texlive-module-te/, texlive-module-th/, texlive-module-ti/. A similar
31 > but simpler option is to use file names as range bounds (the same way
32 > dictionaries use words to demarcate page bounds): each directory will
33 > have a name of the first file located inside. This way files will be
34 > distributed evenly and it's still easy to pick a correct directory where
35 > a file will be located manually.
36
37 What you're talking about is pretty much an adaptive algorithm. It may
38 look like a good at first but it's really hard to predict how it'll work
39 in the future because you can't really predict what will happen to
40 distfiles in the future.
41
42 A few major events that could result in it going competely off:
43
44 a. we stop using split texlive packages and distribute a few big
45 tarballs instead,
46
47 b. texlive packages are renamed to use date before subpackage name,
48
49 c. someone adds another big package set.
50
51 That said, you don't need a big event for that. Many small events may
52 (or may not) cause it to gradually go off. Whenever that happens, we
53 would have to have a contingency plan -- and I don't really like
54 the idea of having to reshuffle all the mirrors all of a sudden.
55
56 I think the cryptographic hash algorithms are a better choice. They may
57 not be perfect but they can cope with a lot of very different data
58 by design. Yes, we could technically accidentally hit a data set that is
59 completely uneven. But it is rather unlikely, compared to home-made
60 algorithms.
61
62 --
63 Best regards,
64 Michał Górny

Replies

Subject Author
Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure Andrew Barchuk <andrew@×××××××.io>