1 |
Hi, |
2 |
|
3 |
The topic of rebooting gentoostats comes here from time to time. Unless |
4 |
I'm mistaken, all the efforts so far were superficial, lacking a clear |
5 |
plan and unwilling to research the problems. I'd like to start |
6 |
a serious discussion focused on the issues we need to solve, and propose |
7 |
some ideas how we could solve them. |
8 |
|
9 |
I can't promise I'll find time to implement it. However, I'd like to |
10 |
get a clear plan on how it should be done if someone actually does it. |
11 |
|
12 |
|
13 |
The big questions |
14 |
================= |
15 |
The way I see it, the primary goal of the project would be to gather |
16 |
statistics on popularity of packages, in order to help us prioritize our |
17 |
attention and make decisions on what to keep and what to remove. Unlike |
18 |
Debian's popcon, I don't think we really want to try to investigate |
19 |
which files are actually used but focus on what's installed. |
20 |
|
21 |
There are a few important questions that need to be answered first: |
22 |
|
23 |
1. Which data do we need to collect? |
24 |
|
25 |
a. list of installed packages? |
26 |
b. versions (or slots?) of installed packages? |
27 |
c. USE flags on installed packages? |
28 |
d. world and world_sets files |
29 |
e. system profile? |
30 |
f. enabled repositories? (possibly filtered to official list) |
31 |
g. distribution? |
32 |
|
33 |
I think d. is most important as it gives us information on what users |
34 |
really want. a. alone is kinda redundant is we have d. c. might have |
35 |
some value when deciding whether to mask a particular flag (and implies |
36 |
a.). |
37 |
|
38 |
e. would be valuable if we wanted to determine the future of particular |
39 |
profiles, as well as e.g. estimate the transition to new versions. |
40 |
|
41 |
f. would be valuable to determine which repositories are used but we |
42 |
need to filter private repos from the output for privacy reasons. |
43 |
|
44 |
g. could be valuable in correlation with other data but not sure if |
45 |
there's much direct value alone. |
46 |
|
47 |
|
48 |
2. How to handle Gentoo derivatives? Some of them could provide |
49 |
meaningful data but some could provide false data (e.g. when derivatives |
50 |
override Gentoo packages). One possible option would be to filter a.-e. |
51 |
to stuff coming from ::gentoo. |
52 |
|
53 |
|
54 |
3. How to keep the data up-to-date? After all, if we just stack a lot |
55 |
of old data, we will soon stop getting meaningful results. I suppose |
56 |
we'll need to timestamp all data and remove old entries. |
57 |
|
58 |
|
59 |
4. How to avoid duplication? If some users submit their results more |
60 |
often than others, they would bias the results. 3. might be related. |
61 |
|
62 |
|
63 |
5. How to handle clusters? Things are simple if we can assume that |
64 |
people will submit data for a few distinct systems. But what about |
65 |
companies that run 50 Gentoo machines with the same or similar setup? |
66 |
What about clusters of 1000 almost identical containers? Big entities |
67 |
could easily bias the results but we should also make it possible for |
68 |
them to participate somehow. |
69 |
|
70 |
|
71 |
6. Security. We don't want to expose information that could be |
72 |
correlated to specific systems, as it could disclose their |
73 |
vulnerabilities. |
74 |
|
75 |
|
76 |
7. Privacy. Besides the above, our sysadmins would appreciate if |
77 |
the data they submitted couldn't be easily correlated to them. If we |
78 |
don't respect privacy of our users, we won't get them to submit data. |
79 |
|
80 |
|
81 |
8. Spam protection. Finally, the service needs to be resilient to being |
82 |
spammed with fake data. Both to users who want to make their packages |
83 |
look more important, and to script kiddies that want to prove a point. |
84 |
|
85 |
|
86 |
My (partial) implementation idea |
87 |
================================ |
88 |
I think our approach should be oriented on privacy/security first, |
89 |
and attempt to make the best of the data we can get while respecting |
90 |
this principle. This means no correlation and no tracking. |
91 |
|
92 |
Once the tool is installed, the user needs to opt-in to using it. This |
93 |
involves accepting a privacy policy and setting up a cronjob. The tool |
94 |
would suggest a (random?) time for submission to take place periodically |
95 |
(say, every week). |
96 |
|
97 |
The submission would contain only raw data, without any identification |
98 |
information. It would be encrypted using our public key. Once |
99 |
uploaded, it would be put into our input queue as-is. |
100 |
|
101 |
Periodically the input queue would be processed in bulk. The individual |
102 |
statistics would be updated and the input would be discarded. This |
103 |
should prevent people trying to correlate changes in statistics with |
104 |
individual uploads. |
105 |
|
106 |
Each counted item would have a timestamp associated, and we'd discard |
107 |
old items per resubmission period. This should ensure that we keep |
108 |
fresh data and people can update their earlier submissions without |
109 |
storing identification data. |
110 |
|
111 |
For example, N users submit their data containing a list of packages |
112 |
every week. This data is used in bulk to update counts of individual |
113 |
packages (technically, to append timestamps to list corresponding to |
114 |
these packages). Data older than one week is discarded, so we have |
115 |
rough counts of package use during the last week. |
116 |
|
117 |
I think this addresses problems 3./6./7. |
118 |
|
119 |
|
120 |
The other major problem is spam protection. The best semi-anonymous way |
121 |
I see is to use submitter's IPv4 addresses (can we support IPv6 then?). |
122 |
We could set a limit of, say, 10 submissions per IPv4 address per week. |
123 |
If some address would exceed that limit, we could require CAPTCHA |
124 |
authorization. |
125 |
|
126 |
I think this would make spamming a bit harder while keeping submissions |
127 |
easy for the most, and a little harder but possible for those of us |
128 |
behind ISP NATs. |
129 |
|
130 |
This should address problems 4./8. and maybe 5. to some degree. |
131 |
|
132 |
|
133 |
A proper solution to cluster problem would probably involve some way to |
134 |
internally collect and combine data data before submission. If you have |
135 |
large clusters of similar systems, I think you'd want to have all |
136 |
packages used on different systems reported as one entry. |
137 |
|
138 |
|
139 |
I think we should collect data from users running all Gentoo |
140 |
derivatives, as long as they are using Gentoo packages. The simplest |
141 |
solution I can think of would be to filter the results on packages (or |
142 |
profiles) installed from ::gentoo. This will work only for distros that |
143 |
expose ::gentoo explicitly (vs copying our ebuilds to their |
144 |
repositories) though. |
145 |
|
146 |
|
147 |
What do you think? Do you foresee other problems? Do you have other |
148 |
needs? Can you think of better solutions? |
149 |
|
150 |
-- |
151 |
Best regards, |
152 |
Michał Górny |