1 |
[my apologies for posting the message to a wrong thread before] |
2 |
|
3 |
Hi everyone, |
4 |
|
5 |
> three possible solutions for splitting distfiles were listed: |
6 |
> |
7 |
> a. using initial portion of filename, |
8 |
> |
9 |
> b. using initial portion of file hash, |
10 |
> |
11 |
> c. using initial portion of filename hash. |
12 |
> |
13 |
> The significant advantage of the filename option was simplicity. With |
14 |
> that solution, the users could easily determine the correct subdirectory |
15 |
> themselves. However, it's significant disadvantage was very uneven |
16 |
> shuffling of data. In particular, the TeΧ Live packages alone count |
17 |
> almost 23500 distfiles and all use a common prefix, making it impossible |
18 |
> to split them further. |
19 |
> |
20 |
> The alternate option of using file hash has the advantage of having |
21 |
> a more balanced split. |
22 |
|
23 |
|
24 |
There's another option to use character ranges for each directory |
25 |
computed in a way to have the files distributed evenly. One way to do |
26 |
that is to use filename prefix of dynamic length so that each range |
27 |
holds the same number of files. E.g. we would have Ab/, Ap/, Ar/ but |
28 |
texlive-module-te/, texlive-module-th/, texlive-module-ti/. A similar |
29 |
but simpler option is to use file names as range bounds (the same way |
30 |
dictionaries use words to demarcate page bounds): each directory will |
31 |
have a name of the first file located inside. This way files will be |
32 |
distributed evenly and it's still easy to pick a correct directory where |
33 |
a file will be located manually. |
34 |
|
35 |
I have implemented a sketch of distfiles splitting that's using file |
36 |
names as bounds in Python to demonstrate the idea (excuse possibly |
37 |
non-idiomatic code, I'm not very versed in Python): |
38 |
|
39 |
$ cat distfile-dirs.py |
40 |
#!/usr/bin/env python3 |
41 |
|
42 |
import sys |
43 |
|
44 |
""" |
45 |
Builds list of dictionary directories to split the list of input files |
46 |
into evenly. Each directory has name of the first file that is located |
47 |
in the directory. Takes number of directories as an argument and reads |
48 |
list of files from stdin. The resulting list or directories is printed |
49 |
to stdout. |
50 |
""" |
51 |
|
52 |
dir_num = int(sys.argv[1]) |
53 |
distfiles = sys.stdin.read().splitlines() |
54 |
distfile_num = len(distfiles) |
55 |
dir_size = distfile_num / dir_num |
56 |
# allows adding files in the beginning without repartitioning |
57 |
dirs = ["0"] |
58 |
next_dir = dir_size |
59 |
while next_dir < distfile_num: |
60 |
dirs.append(distfiles[round(next_dir)]) |
61 |
next_dir += dir_size |
62 |
print("/\n".join(dirs) + "/") |
63 |
|
64 |
$ cat pick-distfiles-dir.py |
65 |
#!/usr/bin/env python3 |
66 |
|
67 |
""" |
68 |
Picks the directory for a given file name. Takes a distfile name as an |
69 |
argument. Reads sorted list of directories from stdin, name of each |
70 |
directory is assumed to be the name of first file that's located inside. |
71 |
""" |
72 |
|
73 |
import sys |
74 |
|
75 |
distfile = sys.argv[1] |
76 |
dirs = sys.stdin.read().splitlines() |
77 |
left = 0 |
78 |
right = len(dirs) - 1 |
79 |
while left < right: |
80 |
pivot = round((left + right) / 2) |
81 |
if (dirs[pivot] <= distfile): |
82 |
left = pivot + 1 |
83 |
else: |
84 |
right = pivot - 1 |
85 |
|
86 |
if distfile < dirs[right]: |
87 |
print(dirs[right-1]) |
88 |
else: |
89 |
print(dirs[right]) |
90 |
|
91 |
$ # distfiles.txt contains all the distfile names |
92 |
$ head -n5 distfiles.txt |
93 |
0CD9CDDE3F56BB5250D87C54592F04CBC24F03BF-wagon-provider-api-2.10.jar |
94 |
0CE1EDB914C94EBC388F086C6827E8BDEEC71AC2-commons-lang-2.6.jar |
95 |
0DCC973606CBD9737541AA5F3E76DED6E3F4D0D0-iri.jar |
96 |
0ad-0.0.22-alpha-unix-build.tar.xz |
97 |
0ad-0.0.22-alpha-unix-data.tar.xz |
98 |
|
99 |
$ # calculate 500 directories to split distfiles into evenly |
100 |
$ cat distfiles.txt | ./distfile-dirs.py 500 > dirs.txt |
101 |
$ tail -n5 dirs.txt |
102 |
xrmap-2.29.tar.bz2/ |
103 |
xview-3.2p1.4-18c.tar.gz/ |
104 |
yasat-700.tar.gz/ |
105 |
yubikey-manager-qt-0.4.0.tar.gz/ |
106 |
zimg-2.5.1.tar.gz |
107 |
|
108 |
$ # pick a directory for xvinfo-1.0.1.tar.bz2 |
109 |
$ cat dirs.txt | ./pick-distfiles-dir.py xvinfo-1.0.1.tar.bz2 |
110 |
xview-3.2p1.4-18c.tar.gz/ |
111 |
|
112 |
Using the approach above the files will distributed evenly among the |
113 |
directories keeping the possibility to determine the directory for a |
114 |
specific file by hand. It's possible if necessary to keep the directory |
115 |
structure unchanged for very long time and it will likely stay |
116 |
well-balanced. Picking a directory for a file is very cheap. The only |
117 |
obvious downside I see is that it's necessary to know list of |
118 |
directories to pick the correct one (can be mitigated by caching the |
119 |
list of directories if important). If it's desirable to make directory |
120 |
names shorter or to look less like file names it's fairly easy to |
121 |
achieve by keeping only unique prefixes of directories. For example: |
122 |
|
123 |
xrmap-2.29.tar.bz2/ |
124 |
xview-3.2p1.4-18c.tar.gz/ |
125 |
yasat-700.tar.gz/ |
126 |
yubikey-manager-qt-0.4.0.tar.gz/ |
127 |
zimg-2.5.1.tar.gz/ |
128 |
|
129 |
will become |
130 |
|
131 |
xr/ |
132 |
xv/ |
133 |
ya/ |
134 |
yu/ |
135 |
z/ |
136 |
|
137 |
Thanks for taking time to consider the suggestion. |
138 |
|
139 |
--- |
140 |
Andrew |