Gentoo Archives: gentoo-dev

From: Samuel Bernardo <samuelbernardo.mail@×××××.com>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
Date: Sun, 26 Apr 2020 17:24:18
Message-Id: 9a298a89-2fb9-7299-5617-9a3b0cb97c7c@gmail.com
In Reply to: [gentoo-dev] [RFC] Ideas for gentoostats implementation by "Michał Górny"
1 Hi everyone,
2
3 gentoostats is a novelty for me and I'm not aware of previous
4 discussions or implementations. But for what I could understand from the
5 comments and Michał Górny explanation, I would start to ask your
6 attention to octoverse[1] initiative.
7
8 Maybe collected statistics could be a possible from a platform to get
9 the additional metadata for the stats from user contribution. What I
10 mean is a way to have a broker to collect all statistics from an
11 organization internally and then to publish that in the end. With such
12 solution would allow to add value for enterprise statistics and also to
13 contribute in the end to Gentoo.
14
15 Each broker cloud use in the end git authentication to publish the
16 results with a merge request that would run the necessary hooks from
17 Gentoo side. We only need here a document specification for data parsing
18 in the end.
19
20 Sorry if my comment is completely out of context, but such an octoverse
21 for Gentoo would be very interesting in my perspective.
22
23 Best,
24
25 Samuel
26
27 [1] https://octoverse.github.com/
28
29 On 4/26/20 9:08 AM, Michał Górny wrote:
30 > Hi,
31 >
32 > The topic of rebooting gentoostats comes here from time to time. Unless
33 > I'm mistaken, all the efforts so far were superficial, lacking a clear
34 > plan and unwilling to research the problems. I'd like to start
35 > a serious discussion focused on the issues we need to solve, and propose
36 > some ideas how we could solve them.
37 >
38 > I can't promise I'll find time to implement it. However, I'd like to
39 > get a clear plan on how it should be done if someone actually does it.
40 >
41 >
42 > The big questions
43 > =================
44 > The way I see it, the primary goal of the project would be to gather
45 > statistics on popularity of packages, in order to help us prioritize our
46 > attention and make decisions on what to keep and what to remove. Unlike
47 > Debian's popcon, I don't think we really want to try to investigate
48 > which files are actually used but focus on what's installed.
49 >
50 > There are a few important questions that need to be answered first:
51 >
52 > 1. Which data do we need to collect?
53 >
54 > a. list of installed packages?
55 > b. versions (or slots?) of installed packages?
56 > c. USE flags on installed packages?
57 > d. world and world_sets files
58 > e. system profile?
59 > f. enabled repositories? (possibly filtered to official list)
60 > g. distribution?
61 >
62 > I think d. is most important as it gives us information on what users
63 > really want. a. alone is kinda redundant is we have d. c. might have
64 > some value when deciding whether to mask a particular flag (and implies
65 > a.).
66 >
67 > e. would be valuable if we wanted to determine the future of particular
68 > profiles, as well as e.g. estimate the transition to new versions.
69 >
70 > f. would be valuable to determine which repositories are used but we
71 > need to filter private repos from the output for privacy reasons.
72 >
73 > g. could be valuable in correlation with other data but not sure if
74 > there's much direct value alone.
75 >
76 >
77 > 2. How to handle Gentoo derivatives? Some of them could provide
78 > meaningful data but some could provide false data (e.g. when derivatives
79 > override Gentoo packages). One possible option would be to filter a.-e.
80 > to stuff coming from ::gentoo.
81 >
82 >
83 > 3. How to keep the data up-to-date? After all, if we just stack a lot
84 > of old data, we will soon stop getting meaningful results. I suppose
85 > we'll need to timestamp all data and remove old entries.
86 >
87 >
88 > 4. How to avoid duplication? If some users submit their results more
89 > often than others, they would bias the results. 3. might be related.
90 >
91 >
92 > 5. How to handle clusters? Things are simple if we can assume that
93 > people will submit data for a few distinct systems. But what about
94 > companies that run 50 Gentoo machines with the same or similar setup?
95 > What about clusters of 1000 almost identical containers? Big entities
96 > could easily bias the results but we should also make it possible for
97 > them to participate somehow.
98 >
99 >
100 > 6. Security. We don't want to expose information that could be
101 > correlated to specific systems, as it could disclose their
102 > vulnerabilities.
103 >
104 >
105 > 7. Privacy. Besides the above, our sysadmins would appreciate if
106 > the data they submitted couldn't be easily correlated to them. If we
107 > don't respect privacy of our users, we won't get them to submit data.
108 >
109 >
110 > 8. Spam protection. Finally, the service needs to be resilient to being
111 > spammed with fake data. Both to users who want to make their packages
112 > look more important, and to script kiddies that want to prove a point.
113 >
114 >
115 > My (partial) implementation idea
116 > ================================
117 > I think our approach should be oriented on privacy/security first,
118 > and attempt to make the best of the data we can get while respecting
119 > this principle. This means no correlation and no tracking.
120 >
121 > Once the tool is installed, the user needs to opt-in to using it. This
122 > involves accepting a privacy policy and setting up a cronjob. The tool
123 > would suggest a (random?) time for submission to take place periodically
124 > (say, every week).
125 >
126 > The submission would contain only raw data, without any identification
127 > information. It would be encrypted using our public key. Once
128 > uploaded, it would be put into our input queue as-is.
129 >
130 > Periodically the input queue would be processed in bulk. The individual
131 > statistics would be updated and the input would be discarded. This
132 > should prevent people trying to correlate changes in statistics with
133 > individual uploads.
134 >
135 > Each counted item would have a timestamp associated, and we'd discard
136 > old items per resubmission period. This should ensure that we keep
137 > fresh data and people can update their earlier submissions without
138 > storing identification data.
139 >
140 > For example, N users submit their data containing a list of packages
141 > every week. This data is used in bulk to update counts of individual
142 > packages (technically, to append timestamps to list corresponding to
143 > these packages). Data older than one week is discarded, so we have
144 > rough counts of package use during the last week.
145 >
146 > I think this addresses problems 3./6./7.
147 >
148 >
149 > The other major problem is spam protection. The best semi-anonymous way
150 > I see is to use submitter's IPv4 addresses (can we support IPv6 then?).
151 > We could set a limit of, say, 10 submissions per IPv4 address per week.
152 > If some address would exceed that limit, we could require CAPTCHA
153 > authorization.
154 >
155 > I think this would make spamming a bit harder while keeping submissions
156 > easy for the most, and a little harder but possible for those of us
157 > behind ISP NATs.
158 >
159 > This should address problems 4./8. and maybe 5. to some degree.
160 >
161 >
162 > A proper solution to cluster problem would probably involve some way to
163 > internally collect and combine data data before submission. If you have
164 > large clusters of similar systems, I think you'd want to have all
165 > packages used on different systems reported as one entry.
166 >
167 >
168 > I think we should collect data from users running all Gentoo
169 > derivatives, as long as they are using Gentoo packages. The simplest
170 > solution I can think of would be to filter the results on packages (or
171 > profiles) installed from ::gentoo. This will work only for distros that
172 > expose ::gentoo explicitly (vs copying our ebuilds to their
173 > repositories) though.
174 >
175 >
176 > What do you think? Do you foresee other problems? Do you have other
177 > needs? Can you think of better solutions?
178 >

Attachments

File name MIME type
signature.asc application/pgp-signature