Gentoo Archives: gentoo-dev

From:	"Michał Górny" <mgorny@g.o>
To:	gentoo-dev <gentoo-dev@l.g.o>
Subject:	[gentoo-dev] [RFC] Ideas for gentoostats implementation
Date:	Sun, 26 Apr 2020 08:09:18
Message-Id:	`d91a622e923a79e0fcd107c2f9ed85a9a1be2148.camel@gentoo.org`

1	Hi,
2
3	The topic of rebooting gentoostats comes here from time to time. Unless
4	I'm mistaken, all the efforts so far were superficial, lacking a clear
5	plan and unwilling to research the problems. I'd like to start
6	a serious discussion focused on the issues we need to solve, and propose
7	some ideas how we could solve them.
8
9	I can't promise I'll find time to implement it. However, I'd like to
10	get a clear plan on how it should be done if someone actually does it.
11
12
13	The big questions
14	=================
15	The way I see it, the primary goal of the project would be to gather
16	statistics on popularity of packages, in order to help us prioritize our
17	attention and make decisions on what to keep and what to remove. Unlike
18	Debian's popcon, I don't think we really want to try to investigate
19	which files are actually used but focus on what's installed.
20
21	There are a few important questions that need to be answered first:
22
23	1. Which data do we need to collect?
24
25	a. list of installed packages?
26	b. versions (or slots?) of installed packages?
27	c. USE flags on installed packages?
28	d. world and world_sets files
29	e. system profile?
30	f. enabled repositories? (possibly filtered to official list)
31	g. distribution?
32
33	I think d. is most important as it gives us information on what users
34	really want. a. alone is kinda redundant is we have d. c. might have
35	some value when deciding whether to mask a particular flag (and implies
36	a.).
37
38	e. would be valuable if we wanted to determine the future of particular
39	profiles, as well as e.g. estimate the transition to new versions.
40
41	f. would be valuable to determine which repositories are used but we
42	need to filter private repos from the output for privacy reasons.
43
44	g. could be valuable in correlation with other data but not sure if
45	there's much direct value alone.
46
47
48	2. How to handle Gentoo derivatives? Some of them could provide
49	meaningful data but some could provide false data (e.g. when derivatives
50	override Gentoo packages). One possible option would be to filter a.-e.
51	to stuff coming from ::gentoo.
52
53
54	3. How to keep the data up-to-date? After all, if we just stack a lot
55	of old data, we will soon stop getting meaningful results. I suppose
56	we'll need to timestamp all data and remove old entries.
57
58
59	4. How to avoid duplication? If some users submit their results more
60	often than others, they would bias the results. 3. might be related.
61
62
63	5. How to handle clusters? Things are simple if we can assume that
64	people will submit data for a few distinct systems. But what about
65	companies that run 50 Gentoo machines with the same or similar setup?
66	What about clusters of 1000 almost identical containers? Big entities
67	could easily bias the results but we should also make it possible for
68	them to participate somehow.
69
70
71	6. Security. We don't want to expose information that could be
72	correlated to specific systems, as it could disclose their
73	vulnerabilities.
74
75
76	7. Privacy. Besides the above, our sysadmins would appreciate if
77	the data they submitted couldn't be easily correlated to them. If we
78	don't respect privacy of our users, we won't get them to submit data.
79
80
81	8. Spam protection. Finally, the service needs to be resilient to being
82	spammed with fake data. Both to users who want to make their packages
83	look more important, and to script kiddies that want to prove a point.
84
85
86	My (partial) implementation idea
87	================================
88	I think our approach should be oriented on privacy/security first,
89	and attempt to make the best of the data we can get while respecting
90	this principle. This means no correlation and no tracking.
91
92	Once the tool is installed, the user needs to opt-in to using it. This
93	involves accepting a privacy policy and setting up a cronjob. The tool
94	would suggest a (random?) time for submission to take place periodically
95	(say, every week).
96
97	The submission would contain only raw data, without any identification
98	information. It would be encrypted using our public key. Once
99	uploaded, it would be put into our input queue as-is.
100
101	Periodically the input queue would be processed in bulk. The individual
102	statistics would be updated and the input would be discarded. This
103	should prevent people trying to correlate changes in statistics with
104	individual uploads.
105
106	Each counted item would have a timestamp associated, and we'd discard
107	old items per resubmission period. This should ensure that we keep
108	fresh data and people can update their earlier submissions without
109	storing identification data.
110
111	For example, N users submit their data containing a list of packages
112	every week. This data is used in bulk to update counts of individual
113	packages (technically, to append timestamps to list corresponding to
114	these packages). Data older than one week is discarded, so we have
115	rough counts of package use during the last week.
116
117	I think this addresses problems 3./6./7.
118
119
120	The other major problem is spam protection. The best semi-anonymous way
121	I see is to use submitter's IPv4 addresses (can we support IPv6 then?).
122	We could set a limit of, say, 10 submissions per IPv4 address per week.
123	If some address would exceed that limit, we could require CAPTCHA
124	authorization.
125
126	I think this would make spamming a bit harder while keeping submissions
127	easy for the most, and a little harder but possible for those of us
128	behind ISP NATs.
129
130	This should address problems 4./8. and maybe 5. to some degree.
131
132
133	A proper solution to cluster problem would probably involve some way to
134	internally collect and combine data data before submission. If you have
135	large clusters of similar systems, I think you'd want to have all
136	packages used on different systems reported as one entry.
137
138
139	I think we should collect data from users running all Gentoo
140	derivatives, as long as they are using Gentoo packages. The simplest
141	solution I can think of would be to filter the results on packages (or
142	profiles) installed from ::gentoo. This will work only for distros that
143	expose ::gentoo explicitly (vs copying our ebuilds to their
144	repositories) though.
145
146
147	What do you think? Do you foresee other problems? Do you have other
148	needs? Can you think of better solutions?
149
150	--
151	Best regards,
152	Michał Górny

Attachments

File name	MIME type
signature.asc	application/pgp-signature

Replies

Subject	Author
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation	Alarig Le Lay <alarig@××××××××××.fr>
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation	"Toralf Förster" <toralf@g.o>
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation	Ulrich Mueller <ulm@g.o>
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation	Thomas Deutschmann <whissi@g.o>
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation	Kent Fredric <kentnl@g.o>
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation	Samuel Bernardo <samuelbernardo.mail@×××××.com>
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation	Andrey Utkin <andrey_utkin@g.o>
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation	Nils Freydank <holgersson@××××××.de>
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation	Jaco Kroon <jaco@××××××.za>
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation	"Toralf Förster" <toralf@g.o>

Report Message

Find on MARC Find on Google Groups