Gentoo Archives: gentoo-dev

From: Andrew Barchuk <andrew@×××××××.io>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
Date: Sun, 28 Jan 2018 20:43:55
Message-Id: 1517172228.2114973.1251027256.0A9C8F3C@webmail.messagingengine.com
In Reply to: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure by "Michał Górny"
1 [my apologies for posting the message to a wrong thread before]
2
3 Hi everyone,
4
5 > three possible solutions for splitting distfiles were listed:
6 >
7 > a. using initial portion of filename,
8 >
9 > b. using initial portion of file hash,
10 >
11 > c. using initial portion of filename hash.
12 >
13 > The significant advantage of the filename option was simplicity. With
14 > that solution, the users could easily determine the correct subdirectory
15 > themselves. However, it's significant disadvantage was very uneven
16 > shuffling of data. In particular, the TeΧ Live packages alone count
17 > almost 23500 distfiles and all use a common prefix, making it impossible
18 > to split them further.
19 >
20 > The alternate option of using file hash has the advantage of having
21 > a more balanced split.
22
23
24 There's another option to use character ranges for each directory
25 computed in a way to have the files distributed evenly. One way to do
26 that is to use filename prefix of dynamic length so that each range
27 holds the same number of files. E.g. we would have Ab/, Ap/, Ar/ but
28 texlive-module-te/, texlive-module-th/, texlive-module-ti/. A similar
29 but simpler option is to use file names as range bounds (the same way
30 dictionaries use words to demarcate page bounds): each directory will
31 have a name of the first file located inside. This way files will be
32 distributed evenly and it's still easy to pick a correct directory where
33 a file will be located manually.
34
35 I have implemented a sketch of distfiles splitting that's using file
36 names as bounds in Python to demonstrate the idea (excuse possibly
37 non-idiomatic code, I'm not very versed in Python):
38
39 $ cat distfile-dirs.py
40 #!/usr/bin/env python3
41
42 import sys
43
44 """
45 Builds list of dictionary directories to split the list of input files
46 into evenly. Each directory has name of the first file that is located
47 in the directory. Takes number of directories as an argument and reads
48 list of files from stdin. The resulting list or directories is printed
49 to stdout.
50 """
51
52 dir_num = int(sys.argv[1])
53 distfiles = sys.stdin.read().splitlines()
54 distfile_num = len(distfiles)
55 dir_size = distfile_num / dir_num
56 # allows adding files in the beginning without repartitioning
57 dirs = ["0"]
58 next_dir = dir_size
59 while next_dir < distfile_num:
60 dirs.append(distfiles[round(next_dir)])
61 next_dir += dir_size
62 print("/\n".join(dirs) + "/")
63
64 $ cat pick-distfiles-dir.py
65 #!/usr/bin/env python3
66
67 """
68 Picks the directory for a given file name. Takes a distfile name as an
69 argument. Reads sorted list of directories from stdin, name of each
70 directory is assumed to be the name of first file that's located inside.
71 """
72
73 import sys
74
75 distfile = sys.argv[1]
76 dirs = sys.stdin.read().splitlines()
77 left = 0
78 right = len(dirs) - 1
79 while left < right:
80 pivot = round((left + right) / 2)
81 if (dirs[pivot] <= distfile):
82 left = pivot + 1
83 else:
84 right = pivot - 1
85
86 if distfile < dirs[right]:
87 print(dirs[right-1])
88 else:
89 print(dirs[right])
90
91 $ # distfiles.txt contains all the distfile names
92 $ head -n5 distfiles.txt
93 0CD9CDDE3F56BB5250D87C54592F04CBC24F03BF-wagon-provider-api-2.10.jar
94 0CE1EDB914C94EBC388F086C6827E8BDEEC71AC2-commons-lang-2.6.jar
95 0DCC973606CBD9737541AA5F3E76DED6E3F4D0D0-iri.jar
96 0ad-0.0.22-alpha-unix-build.tar.xz
97 0ad-0.0.22-alpha-unix-data.tar.xz
98
99 $ # calculate 500 directories to split distfiles into evenly
100 $ cat distfiles.txt | ./distfile-dirs.py 500 > dirs.txt
101 $ tail -n5 dirs.txt
102 xrmap-2.29.tar.bz2/
103 xview-3.2p1.4-18c.tar.gz/
104 yasat-700.tar.gz/
105 yubikey-manager-qt-0.4.0.tar.gz/
106 zimg-2.5.1.tar.gz
107
108 $ # pick a directory for xvinfo-1.0.1.tar.bz2
109 $ cat dirs.txt | ./pick-distfiles-dir.py xvinfo-1.0.1.tar.bz2
110 xview-3.2p1.4-18c.tar.gz/
111
112 Using the approach above the files will distributed evenly among the
113 directories keeping the possibility to determine the directory for a
114 specific file by hand. It's possible if necessary to keep the directory
115 structure unchanged for very long time and it will likely stay
116 well-balanced. Picking a directory for a file is very cheap. The only
117 obvious downside I see is that it's necessary to know list of
118 directories to pick the correct one (can be mitigated by caching the
119 list of directories if important). If it's desirable to make directory
120 names shorter or to look less like file names it's fairly easy to
121 achieve by keeping only unique prefixes of directories. For example:
122
123 xrmap-2.29.tar.bz2/
124 xview-3.2p1.4-18c.tar.gz/
125 yasat-700.tar.gz/
126 yubikey-manager-qt-0.4.0.tar.gz/
127 zimg-2.5.1.tar.gz/
128
129 will become
130
131 xr/
132 xv/
133 ya/
134 yu/
135 z/
136
137 Thanks for taking time to consider the suggestion.
138
139 ---
140 Andrew

Replies