1 |
W dniu nie, 28.01.2018 o godzinie 21∶43 +0100, użytkownik Andrew Barchuk |
2 |
napisał: |
3 |
> [my apologies for posting the message to a wrong thread before] |
4 |
> |
5 |
> Hi everyone, |
6 |
> |
7 |
> > three possible solutions for splitting distfiles were listed: |
8 |
> > |
9 |
> > a. using initial portion of filename, |
10 |
> > |
11 |
> > b. using initial portion of file hash, |
12 |
> > |
13 |
> > c. using initial portion of filename hash. |
14 |
> > |
15 |
> > The significant advantage of the filename option was simplicity. With |
16 |
> > that solution, the users could easily determine the correct subdirectory |
17 |
> > themselves. However, it's significant disadvantage was very uneven |
18 |
> > shuffling of data. In particular, the TeΧ Live packages alone count |
19 |
> > almost 23500 distfiles and all use a common prefix, making it impossible |
20 |
> > to split them further. |
21 |
> > |
22 |
> > The alternate option of using file hash has the advantage of having |
23 |
> > a more balanced split. |
24 |
> |
25 |
> |
26 |
> There's another option to use character ranges for each directory |
27 |
> computed in a way to have the files distributed evenly. One way to do |
28 |
> that is to use filename prefix of dynamic length so that each range |
29 |
> holds the same number of files. E.g. we would have Ab/, Ap/, Ar/ but |
30 |
> texlive-module-te/, texlive-module-th/, texlive-module-ti/. A similar |
31 |
> but simpler option is to use file names as range bounds (the same way |
32 |
> dictionaries use words to demarcate page bounds): each directory will |
33 |
> have a name of the first file located inside. This way files will be |
34 |
> distributed evenly and it's still easy to pick a correct directory where |
35 |
> a file will be located manually. |
36 |
|
37 |
What you're talking about is pretty much an adaptive algorithm. It may |
38 |
look like a good at first but it's really hard to predict how it'll work |
39 |
in the future because you can't really predict what will happen to |
40 |
distfiles in the future. |
41 |
|
42 |
A few major events that could result in it going competely off: |
43 |
|
44 |
a. we stop using split texlive packages and distribute a few big |
45 |
tarballs instead, |
46 |
|
47 |
b. texlive packages are renamed to use date before subpackage name, |
48 |
|
49 |
c. someone adds another big package set. |
50 |
|
51 |
That said, you don't need a big event for that. Many small events may |
52 |
(or may not) cause it to gradually go off. Whenever that happens, we |
53 |
would have to have a contingency plan -- and I don't really like |
54 |
the idea of having to reshuffle all the mirrors all of a sudden. |
55 |
|
56 |
I think the cryptographic hash algorithms are a better choice. They may |
57 |
not be perfect but they can cope with a lot of very different data |
58 |
by design. Yes, we could technically accidentally hit a data set that is |
59 |
completely uneven. But it is rather unlikely, compared to home-made |
60 |
algorithms. |
61 |
|
62 |
-- |
63 |
Best regards, |
64 |
Michał Górny |