1 |
Hi everyone, |
2 |
|
3 |
gentoostats is a novelty for me and I'm not aware of previous |
4 |
discussions or implementations. But for what I could understand from the |
5 |
comments and Michał Górny explanation, I would start to ask your |
6 |
attention to octoverse[1] initiative. |
7 |
|
8 |
Maybe collected statistics could be a possible from a platform to get |
9 |
the additional metadata for the stats from user contribution. What I |
10 |
mean is a way to have a broker to collect all statistics from an |
11 |
organization internally and then to publish that in the end. With such |
12 |
solution would allow to add value for enterprise statistics and also to |
13 |
contribute in the end to Gentoo. |
14 |
|
15 |
Each broker cloud use in the end git authentication to publish the |
16 |
results with a merge request that would run the necessary hooks from |
17 |
Gentoo side. We only need here a document specification for data parsing |
18 |
in the end. |
19 |
|
20 |
Sorry if my comment is completely out of context, but such an octoverse |
21 |
for Gentoo would be very interesting in my perspective. |
22 |
|
23 |
Best, |
24 |
|
25 |
Samuel |
26 |
|
27 |
[1] https://octoverse.github.com/ |
28 |
|
29 |
On 4/26/20 9:08 AM, Michał Górny wrote: |
30 |
> Hi, |
31 |
> |
32 |
> The topic of rebooting gentoostats comes here from time to time. Unless |
33 |
> I'm mistaken, all the efforts so far were superficial, lacking a clear |
34 |
> plan and unwilling to research the problems. I'd like to start |
35 |
> a serious discussion focused on the issues we need to solve, and propose |
36 |
> some ideas how we could solve them. |
37 |
> |
38 |
> I can't promise I'll find time to implement it. However, I'd like to |
39 |
> get a clear plan on how it should be done if someone actually does it. |
40 |
> |
41 |
> |
42 |
> The big questions |
43 |
> ================= |
44 |
> The way I see it, the primary goal of the project would be to gather |
45 |
> statistics on popularity of packages, in order to help us prioritize our |
46 |
> attention and make decisions on what to keep and what to remove. Unlike |
47 |
> Debian's popcon, I don't think we really want to try to investigate |
48 |
> which files are actually used but focus on what's installed. |
49 |
> |
50 |
> There are a few important questions that need to be answered first: |
51 |
> |
52 |
> 1. Which data do we need to collect? |
53 |
> |
54 |
> a. list of installed packages? |
55 |
> b. versions (or slots?) of installed packages? |
56 |
> c. USE flags on installed packages? |
57 |
> d. world and world_sets files |
58 |
> e. system profile? |
59 |
> f. enabled repositories? (possibly filtered to official list) |
60 |
> g. distribution? |
61 |
> |
62 |
> I think d. is most important as it gives us information on what users |
63 |
> really want. a. alone is kinda redundant is we have d. c. might have |
64 |
> some value when deciding whether to mask a particular flag (and implies |
65 |
> a.). |
66 |
> |
67 |
> e. would be valuable if we wanted to determine the future of particular |
68 |
> profiles, as well as e.g. estimate the transition to new versions. |
69 |
> |
70 |
> f. would be valuable to determine which repositories are used but we |
71 |
> need to filter private repos from the output for privacy reasons. |
72 |
> |
73 |
> g. could be valuable in correlation with other data but not sure if |
74 |
> there's much direct value alone. |
75 |
> |
76 |
> |
77 |
> 2. How to handle Gentoo derivatives? Some of them could provide |
78 |
> meaningful data but some could provide false data (e.g. when derivatives |
79 |
> override Gentoo packages). One possible option would be to filter a.-e. |
80 |
> to stuff coming from ::gentoo. |
81 |
> |
82 |
> |
83 |
> 3. How to keep the data up-to-date? After all, if we just stack a lot |
84 |
> of old data, we will soon stop getting meaningful results. I suppose |
85 |
> we'll need to timestamp all data and remove old entries. |
86 |
> |
87 |
> |
88 |
> 4. How to avoid duplication? If some users submit their results more |
89 |
> often than others, they would bias the results. 3. might be related. |
90 |
> |
91 |
> |
92 |
> 5. How to handle clusters? Things are simple if we can assume that |
93 |
> people will submit data for a few distinct systems. But what about |
94 |
> companies that run 50 Gentoo machines with the same or similar setup? |
95 |
> What about clusters of 1000 almost identical containers? Big entities |
96 |
> could easily bias the results but we should also make it possible for |
97 |
> them to participate somehow. |
98 |
> |
99 |
> |
100 |
> 6. Security. We don't want to expose information that could be |
101 |
> correlated to specific systems, as it could disclose their |
102 |
> vulnerabilities. |
103 |
> |
104 |
> |
105 |
> 7. Privacy. Besides the above, our sysadmins would appreciate if |
106 |
> the data they submitted couldn't be easily correlated to them. If we |
107 |
> don't respect privacy of our users, we won't get them to submit data. |
108 |
> |
109 |
> |
110 |
> 8. Spam protection. Finally, the service needs to be resilient to being |
111 |
> spammed with fake data. Both to users who want to make their packages |
112 |
> look more important, and to script kiddies that want to prove a point. |
113 |
> |
114 |
> |
115 |
> My (partial) implementation idea |
116 |
> ================================ |
117 |
> I think our approach should be oriented on privacy/security first, |
118 |
> and attempt to make the best of the data we can get while respecting |
119 |
> this principle. This means no correlation and no tracking. |
120 |
> |
121 |
> Once the tool is installed, the user needs to opt-in to using it. This |
122 |
> involves accepting a privacy policy and setting up a cronjob. The tool |
123 |
> would suggest a (random?) time for submission to take place periodically |
124 |
> (say, every week). |
125 |
> |
126 |
> The submission would contain only raw data, without any identification |
127 |
> information. It would be encrypted using our public key. Once |
128 |
> uploaded, it would be put into our input queue as-is. |
129 |
> |
130 |
> Periodically the input queue would be processed in bulk. The individual |
131 |
> statistics would be updated and the input would be discarded. This |
132 |
> should prevent people trying to correlate changes in statistics with |
133 |
> individual uploads. |
134 |
> |
135 |
> Each counted item would have a timestamp associated, and we'd discard |
136 |
> old items per resubmission period. This should ensure that we keep |
137 |
> fresh data and people can update their earlier submissions without |
138 |
> storing identification data. |
139 |
> |
140 |
> For example, N users submit their data containing a list of packages |
141 |
> every week. This data is used in bulk to update counts of individual |
142 |
> packages (technically, to append timestamps to list corresponding to |
143 |
> these packages). Data older than one week is discarded, so we have |
144 |
> rough counts of package use during the last week. |
145 |
> |
146 |
> I think this addresses problems 3./6./7. |
147 |
> |
148 |
> |
149 |
> The other major problem is spam protection. The best semi-anonymous way |
150 |
> I see is to use submitter's IPv4 addresses (can we support IPv6 then?). |
151 |
> We could set a limit of, say, 10 submissions per IPv4 address per week. |
152 |
> If some address would exceed that limit, we could require CAPTCHA |
153 |
> authorization. |
154 |
> |
155 |
> I think this would make spamming a bit harder while keeping submissions |
156 |
> easy for the most, and a little harder but possible for those of us |
157 |
> behind ISP NATs. |
158 |
> |
159 |
> This should address problems 4./8. and maybe 5. to some degree. |
160 |
> |
161 |
> |
162 |
> A proper solution to cluster problem would probably involve some way to |
163 |
> internally collect and combine data data before submission. If you have |
164 |
> large clusters of similar systems, I think you'd want to have all |
165 |
> packages used on different systems reported as one entry. |
166 |
> |
167 |
> |
168 |
> I think we should collect data from users running all Gentoo |
169 |
> derivatives, as long as they are using Gentoo packages. The simplest |
170 |
> solution I can think of would be to filter the results on packages (or |
171 |
> profiles) installed from ::gentoo. This will work only for distros that |
172 |
> expose ::gentoo explicitly (vs copying our ebuilds to their |
173 |
> repositories) though. |
174 |
> |
175 |
> |
176 |
> What do you think? Do you foresee other problems? Do you have other |
177 |
> needs? Can you think of better solutions? |
178 |
> |