1 |
Hi everyone, |
2 |
|
3 |
> three possible solutions for splitting distfiles were listed: |
4 |
> |
5 |
> a. using initial portion of filename, |
6 |
> |
7 |
> b. using initial portion of file hash, |
8 |
> |
9 |
> c. using initial portion of filename hash. |
10 |
> |
11 |
> The significant advantage of the filename option was simplicity. With |
12 |
> that solution, the users could easily determine the correct subdirectory |
13 |
> themselves. However, it's significant disadvantage was very uneven |
14 |
> shuffling of data. In particular, the TeΧ Live packages alone count |
15 |
> almost 23500 distfiles and all use a common prefix, making it impossible |
16 |
> to split them further. |
17 |
> |
18 |
> The alternate option of using file hash has the advantage of having |
19 |
> a more balanced split. |
20 |
|
21 |
There's another option to use character ranges for each directory |
22 |
computed in a way to have the files distributed evenly. One way to do |
23 |
that is to use filename prefix of dynamic length so that each range |
24 |
holds the same number of files. E.g. we would have Ab/, Ap/, Ar/ but |
25 |
texlive-module-te/, texlive-module-th/, texlive-module-ti/. A similar |
26 |
but simpler option is to use file names as range bounds (the same way |
27 |
dictionaries use words to demarcate page bounds): each directory will |
28 |
have a name of the first file located inside. This way files will be |
29 |
distributed evenly and it's still easy to pick a correct directory where |
30 |
a file will be located manually. |
31 |
|
32 |
I have implemented a sketch of distfiles splitting that's using file |
33 |
names as bounds in Python to demonstrate the idea (excuse possibly |
34 |
non-idiomatic code, I'm not very versed in Python): |
35 |
|
36 |
$ cat distfile-dirs.py |
37 |
#!/usr/bin/env python3 |
38 |
|
39 |
import sys |
40 |
|
41 |
""" |
42 |
Builds list of dictionary directories to split the list of input files |
43 |
into evenly. Each directory has name of the first file that is located |
44 |
in the directory. Takes number of directories as an argument and reads |
45 |
list of files from stdin. The resulting list or directories is printed |
46 |
to stdout. |
47 |
""" |
48 |
|
49 |
dir_num = int(sys.argv[1]) |
50 |
distfiles = sys.stdin.read().splitlines() |
51 |
distfile_num = len(distfiles) |
52 |
dir_size = distfile_num / dir_num |
53 |
# allows adding files in the beginning without repartitioning |
54 |
dirs = ["0"] |
55 |
next_dir = dir_size |
56 |
while next_dir < distfile_num: |
57 |
dirs.append(distfiles[round(next_dir)]) |
58 |
next_dir += dir_size |
59 |
print("/\n".join(dirs) + "/") |
60 |
|
61 |
$ cat pick-distfiles-dir.py |
62 |
#!/usr/bin/env python3 |
63 |
|
64 |
""" |
65 |
Picks the directory for a given file name. Takes a distfile name as an |
66 |
argument. Reads sorted list of directories from stdin, name of each |
67 |
directory is assumed to be the name of first file that's located inside. |
68 |
""" |
69 |
|
70 |
import sys |
71 |
|
72 |
distfile = sys.argv[1] |
73 |
dirs = sys.stdin.read().splitlines() |
74 |
left = 0 |
75 |
right = len(dirs) - 1 |
76 |
while left < right: |
77 |
pivot = round((left + right) / 2) |
78 |
if (dirs[pivot] <= distfile): |
79 |
left = pivot + 1 |
80 |
else: |
81 |
right = pivot - 1 |
82 |
|
83 |
if distfile < dirs[right]: |
84 |
print(dirs[right-1]) |
85 |
else: |
86 |
print(dirs[right]) |
87 |
|
88 |
$ # distfiles.txt contains all the distfile names |
89 |
$ head -n5 distfiles.txt |
90 |
0CD9CDDE3F56BB5250D87C54592F04CBC24F03BF-wagon-provider-api-2.10.jar |
91 |
0CE1EDB914C94EBC388F086C6827E8BDEEC71AC2-commons-lang-2.6.jar |
92 |
0DCC973606CBD9737541AA5F3E76DED6E3F4D0D0-iri.jar |
93 |
0ad-0.0.22-alpha-unix-build.tar.xz |
94 |
0ad-0.0.22-alpha-unix-data.tar.xz |
95 |
|
96 |
$ # calculate 500 directories to split distfiles into evenly |
97 |
$ cat distfiles.txt | ./distfile-dirs.py 500 > dirs.txt |
98 |
$ tail -n5 dirs.txt |
99 |
xrmap-2.29.tar.bz2/ |
100 |
xview-3.2p1.4-18c.tar.gz/ |
101 |
yasat-700.tar.gz/ |
102 |
yubikey-manager-qt-0.4.0.tar.gz/ |
103 |
zimg-2.5.1.tar.gz |
104 |
|
105 |
$ # pick a directory for xvinfo-1.0.1.tar.bz2 |
106 |
$ cat dirs.txt | ./pick-distfiles-dir.py xvinfo-1.0.1.tar.bz2 |
107 |
xview-3.2p1.4-18c.tar.gz/ |
108 |
|
109 |
Using the approach above the files will distributed evenly among the |
110 |
directories keeping the possibility to determine the directory for a |
111 |
specific file by hand. It's possible if necessary to keep the directory |
112 |
structure unchanged for very long time and it will likely stay |
113 |
well-balanced. Picking a directory for a file is very cheap. The only |
114 |
obvious downside I see is that it's necessary to know list of |
115 |
directories to pick the correct one (can be mitigated by caching the |
116 |
list of directories if important). If it's desirable to make directory |
117 |
names shorter or to look less like file names it's fairly easy to |
118 |
achieve by keeping only unique prefixes of directories. For example: |
119 |
|
120 |
xrmap-2.29.tar.bz2/ |
121 |
xview-3.2p1.4-18c.tar.gz/ |
122 |
yasat-700.tar.gz/ |
123 |
yubikey-manager-qt-0.4.0.tar.gz/ |
124 |
zimg-2.5.1.tar.gz/ |
125 |
|
126 |
will become |
127 |
|
128 |
xr/ |
129 |
xv/ |
130 |
ya/ |
131 |
yu/ |
132 |
z/ |
133 |
|
134 |
Thanks for taking time to consider the suggestion. |
135 |
|
136 |
--- |
137 |
Andrew |