Gentoo Archives: gentoo-dev

From:	Andrew Barchuk <andrew@×××××××.io>
To:	gentoo-dev@l.g.o
Subject:	Re: [gentoo-dev] [News item review] Portage rsync tree verification (v4)
Date:	Sun, 28 Jan 2018 20:30:38
Message-Id:	`1517171431.2109764.1251018832.6F16557B@webmail.messagingengine.com`
In Reply to:	Re: [gentoo-dev] [News item review] Portage rsync tree verification (v4) by "Michał Górny"

1	Hi everyone,
2
3	> three possible solutions for splitting distfiles were listed:
4	>
5	> a. using initial portion of filename,
6	>
7	> b. using initial portion of file hash,
8	>
9	> c. using initial portion of filename hash.
10	>
11	> The significant advantage of the filename option was simplicity. With
12	> that solution, the users could easily determine the correct subdirectory
13	> themselves. However, it's significant disadvantage was very uneven
14	> shuffling of data. In particular, the TeΧ Live packages alone count
15	> almost 23500 distfiles and all use a common prefix, making it impossible
16	> to split them further.
17	>
18	> The alternate option of using file hash has the advantage of having
19	> a more balanced split.
20
21	There's another option to use character ranges for each directory
22	computed in a way to have the files distributed evenly. One way to do
23	that is to use filename prefix of dynamic length so that each range
24	holds the same number of files. E.g. we would have Ab/, Ap/, Ar/ but
25	texlive-module-te/, texlive-module-th/, texlive-module-ti/. A similar
26	but simpler option is to use file names as range bounds (the same way
27	dictionaries use words to demarcate page bounds): each directory will
28	have a name of the first file located inside. This way files will be
29	distributed evenly and it's still easy to pick a correct directory where
30	a file will be located manually.
31
32	I have implemented a sketch of distfiles splitting that's using file
33	names as bounds in Python to demonstrate the idea (excuse possibly
34	non-idiomatic code, I'm not very versed in Python):
35
36	$ cat distfile-dirs.py
37	#!/usr/bin/env python3
38
39	import sys
40
41	"""
42	Builds list of dictionary directories to split the list of input files
43	into evenly. Each directory has name of the first file that is located
44	in the directory. Takes number of directories as an argument and reads
45	list of files from stdin. The resulting list or directories is printed
46	to stdout.
47	"""
48
49	dir_num = int(sys.argv[1])
50	distfiles = sys.stdin.read().splitlines()
51	distfile_num = len(distfiles)
52	dir_size = distfile_num / dir_num
53	# allows adding files in the beginning without repartitioning
54	dirs = ["0"]
55	next_dir = dir_size
56	while next_dir < distfile_num:
57	dirs.append(distfiles[round(next_dir)])
58	next_dir += dir_size
59	print("/\n".join(dirs) + "/")
60
61	$ cat pick-distfiles-dir.py
62	#!/usr/bin/env python3
63
64	"""
65	Picks the directory for a given file name. Takes a distfile name as an
66	argument. Reads sorted list of directories from stdin, name of each
67	directory is assumed to be the name of first file that's located inside.
68	"""
69
70	import sys
71
72	distfile = sys.argv[1]
73	dirs = sys.stdin.read().splitlines()
74	left = 0
75	right = len(dirs) - 1
76	while left < right:
77	pivot = round((left + right) / 2)
78	if (dirs[pivot] <= distfile):
79	left = pivot + 1
80	else:
81	right = pivot - 1
82
83	if distfile < dirs[right]:
84	print(dirs[right-1])
85	else:
86	print(dirs[right])
87
88	$ # distfiles.txt contains all the distfile names
89	$ head -n5 distfiles.txt
90	0CD9CDDE3F56BB5250D87C54592F04CBC24F03BF-wagon-provider-api-2.10.jar
91	0CE1EDB914C94EBC388F086C6827E8BDEEC71AC2-commons-lang-2.6.jar
92	0DCC973606CBD9737541AA5F3E76DED6E3F4D0D0-iri.jar
93	0ad-0.0.22-alpha-unix-build.tar.xz
94	0ad-0.0.22-alpha-unix-data.tar.xz
95
96	$ # calculate 500 directories to split distfiles into evenly
97	$ cat distfiles.txt \| ./distfile-dirs.py 500 > dirs.txt
98	$ tail -n5 dirs.txt
99	xrmap-2.29.tar.bz2/
100	xview-3.2p1.4-18c.tar.gz/
101	yasat-700.tar.gz/
102	yubikey-manager-qt-0.4.0.tar.gz/
103	zimg-2.5.1.tar.gz
104
105	$ # pick a directory for xvinfo-1.0.1.tar.bz2
106	$ cat dirs.txt \| ./pick-distfiles-dir.py xvinfo-1.0.1.tar.bz2
107	xview-3.2p1.4-18c.tar.gz/
108
109	Using the approach above the files will distributed evenly among the
110	directories keeping the possibility to determine the directory for a
111	specific file by hand. It's possible if necessary to keep the directory
112	structure unchanged for very long time and it will likely stay
113	well-balanced. Picking a directory for a file is very cheap. The only
114	obvious downside I see is that it's necessary to know list of
115	directories to pick the correct one (can be mitigated by caching the
116	list of directories if important). If it's desirable to make directory
117	names shorter or to look less like file names it's fairly easy to
118	achieve by keeping only unique prefixes of directories. For example:
119
120	xrmap-2.29.tar.bz2/
121	xview-3.2p1.4-18c.tar.gz/
122	yasat-700.tar.gz/
123	yubikey-manager-qt-0.4.0.tar.gz/
124	zimg-2.5.1.tar.gz/
125
126	will become
127
128	xr/
129	xv/
130	ya/
131	yu/
132	z/
133
134	Thanks for taking time to consider the suggestion.
135
136	---
137	Andrew

Replies

Subject	Author
Re: [gentoo-dev] [News item review] Portage rsync tree verification (v4)	"Robin H. Johnson" <robbat2@g.o>

Report Message

Find on MARC Find on Google Groups