Gentoo Archives: gentoo-dev

From: "Michał Górny" <mgorny@g.o>
To: gentoo-dev <gentoo-dev@l.g.o>
Subject: [gentoo-dev] [RFC] Ideas for gentoostats implementation
Date: Sun, 26 Apr 2020 08:09:18
Message-Id: d91a622e923a79e0fcd107c2f9ed85a9a1be2148.camel@gentoo.org
1 Hi,
2
3 The topic of rebooting gentoostats comes here from time to time. Unless
4 I'm mistaken, all the efforts so far were superficial, lacking a clear
5 plan and unwilling to research the problems. I'd like to start
6 a serious discussion focused on the issues we need to solve, and propose
7 some ideas how we could solve them.
8
9 I can't promise I'll find time to implement it. However, I'd like to
10 get a clear plan on how it should be done if someone actually does it.
11
12
13 The big questions
14 =================
15 The way I see it, the primary goal of the project would be to gather
16 statistics on popularity of packages, in order to help us prioritize our
17 attention and make decisions on what to keep and what to remove. Unlike
18 Debian's popcon, I don't think we really want to try to investigate
19 which files are actually used but focus on what's installed.
20
21 There are a few important questions that need to be answered first:
22
23 1. Which data do we need to collect?
24
25 a. list of installed packages?
26 b. versions (or slots?) of installed packages?
27 c. USE flags on installed packages?
28 d. world and world_sets files
29 e. system profile?
30 f. enabled repositories? (possibly filtered to official list)
31 g. distribution?
32
33 I think d. is most important as it gives us information on what users
34 really want. a. alone is kinda redundant is we have d. c. might have
35 some value when deciding whether to mask a particular flag (and implies
36 a.).
37
38 e. would be valuable if we wanted to determine the future of particular
39 profiles, as well as e.g. estimate the transition to new versions.
40
41 f. would be valuable to determine which repositories are used but we
42 need to filter private repos from the output for privacy reasons.
43
44 g. could be valuable in correlation with other data but not sure if
45 there's much direct value alone.
46
47
48 2. How to handle Gentoo derivatives? Some of them could provide
49 meaningful data but some could provide false data (e.g. when derivatives
50 override Gentoo packages). One possible option would be to filter a.-e.
51 to stuff coming from ::gentoo.
52
53
54 3. How to keep the data up-to-date? After all, if we just stack a lot
55 of old data, we will soon stop getting meaningful results. I suppose
56 we'll need to timestamp all data and remove old entries.
57
58
59 4. How to avoid duplication? If some users submit their results more
60 often than others, they would bias the results. 3. might be related.
61
62
63 5. How to handle clusters? Things are simple if we can assume that
64 people will submit data for a few distinct systems. But what about
65 companies that run 50 Gentoo machines with the same or similar setup?
66 What about clusters of 1000 almost identical containers? Big entities
67 could easily bias the results but we should also make it possible for
68 them to participate somehow.
69
70
71 6. Security. We don't want to expose information that could be
72 correlated to specific systems, as it could disclose their
73 vulnerabilities.
74
75
76 7. Privacy. Besides the above, our sysadmins would appreciate if
77 the data they submitted couldn't be easily correlated to them. If we
78 don't respect privacy of our users, we won't get them to submit data.
79
80
81 8. Spam protection. Finally, the service needs to be resilient to being
82 spammed with fake data. Both to users who want to make their packages
83 look more important, and to script kiddies that want to prove a point.
84
85
86 My (partial) implementation idea
87 ================================
88 I think our approach should be oriented on privacy/security first,
89 and attempt to make the best of the data we can get while respecting
90 this principle. This means no correlation and no tracking.
91
92 Once the tool is installed, the user needs to opt-in to using it. This
93 involves accepting a privacy policy and setting up a cronjob. The tool
94 would suggest a (random?) time for submission to take place periodically
95 (say, every week).
96
97 The submission would contain only raw data, without any identification
98 information. It would be encrypted using our public key. Once
99 uploaded, it would be put into our input queue as-is.
100
101 Periodically the input queue would be processed in bulk. The individual
102 statistics would be updated and the input would be discarded. This
103 should prevent people trying to correlate changes in statistics with
104 individual uploads.
105
106 Each counted item would have a timestamp associated, and we'd discard
107 old items per resubmission period. This should ensure that we keep
108 fresh data and people can update their earlier submissions without
109 storing identification data.
110
111 For example, N users submit their data containing a list of packages
112 every week. This data is used in bulk to update counts of individual
113 packages (technically, to append timestamps to list corresponding to
114 these packages). Data older than one week is discarded, so we have
115 rough counts of package use during the last week.
116
117 I think this addresses problems 3./6./7.
118
119
120 The other major problem is spam protection. The best semi-anonymous way
121 I see is to use submitter's IPv4 addresses (can we support IPv6 then?).
122 We could set a limit of, say, 10 submissions per IPv4 address per week.
123 If some address would exceed that limit, we could require CAPTCHA
124 authorization.
125
126 I think this would make spamming a bit harder while keeping submissions
127 easy for the most, and a little harder but possible for those of us
128 behind ISP NATs.
129
130 This should address problems 4./8. and maybe 5. to some degree.
131
132
133 A proper solution to cluster problem would probably involve some way to
134 internally collect and combine data data before submission. If you have
135 large clusters of similar systems, I think you'd want to have all
136 packages used on different systems reported as one entry.
137
138
139 I think we should collect data from users running all Gentoo
140 derivatives, as long as they are using Gentoo packages. The simplest
141 solution I can think of would be to filter the results on packages (or
142 profiles) installed from ::gentoo. This will work only for distros that
143 expose ::gentoo explicitly (vs copying our ebuilds to their
144 repositories) though.
145
146
147 What do you think? Do you foresee other problems? Do you have other
148 needs? Can you think of better solutions?
149
150 --
151 Best regards,
152 Michał Górny

Attachments

File name MIME type
signature.asc application/pgp-signature

Replies