1 |
Hi Michał, and the rest of the Gentoo devs, |
2 |
|
3 |
I've been patiently sitting and watching this discussion. |
4 |
|
5 |
I raised some ideas with another developer (Not Michał) just days before |
6 |
he raised this thread to the ML. |
7 |
|
8 |
I believe all points raised to this point is valid, I'll try to summarise: |
9 |
|
10 |
1. This must be completely *opt in*. |
11 |
2. Anonymity was discussed by various parties (privacy). |
12 |
3. "spam" protection (ie, preventing bogus data from entering). |
13 |
4. Trustworthiness of data. |
14 |
5. Acceptance of some form of privacy policy. |
15 |
|
16 |
In my opinion, points 2 and 3 works against each other, in that if |
17 |
registration is compulsory if you would like to submit stats, then we |
18 |
can control the spam more easily (not foolproof), but requiring |
19 |
registration also raises the entry barrier. I'd be completely willing |
20 |
to provide at least an email address as part of a submission. |
21 |
|
22 |
All of the replies seems to have focused purely on yes/no, do it or |
23 |
don't. Not many have addressed the benefits to end users/system |
24 |
administrators. It seems to focus is on what we as developers can get |
25 |
out of this. |
26 |
|
27 |
Regarding the above points: |
28 |
|
29 |
1. I fully agree. This should not be forced on anyone. |
30 |
2. Happy to concede that some people may wish to submit anonymously. |
31 |
Let them. |
32 |
3. I'll address this below. |
33 |
4. A lot of the discussion has been around the usefulness of the data, |
34 |
and I concede to Thomas that this may (or may not) generate "decision |
35 |
blind spots" or as per "artificially increase decision certainty". I |
36 |
don't see how this is worse than what we've got now. |
37 |
5. We have the infrastructure for this already by way of licenses. So |
38 |
we ship with "GPLv2/3/whatever + GentooPrivacy", and users have to first |
39 |
take explicit action to accept GentooPrivacy. |
40 |
|
41 |
I have some other ideas around this, which will tread even further on |
42 |
privacy, but again, all of this should be a kind of opt-in, and building |
43 |
on the ideas by Kent where he suggested a form of submission proxy |
44 |
(STATS_SERVER), we could potentially give the full benefit of the code |
45 |
to such entities, but then still allow them to submit "upstream" in a |
46 |
more filtered manner. |
47 |
|
48 |
Bottom line, in my opinion: Any data is better than no data! |
49 |
|
50 |
Whilst we can't say "no one is using xyz", we will at least be able to |
51 |
say "hey, some people are using xyz", and whilst this may generate some |
52 |
blinds it at least enables us to test known use cases during |
53 |
test-builds, eg, we know for a fact a thousand users are using package X |
54 |
with USE flags "-* a b c", so we should definitely run that as a compile |
55 |
test. Your build breaks frequently? Would you mind submitting stats? |
56 |
Great thank you. You not willing to do that, then my stance becomes one |
57 |
of "ok, I'll help where I can, but really, please consider us to help |
58 |
you, if you submit stats we can pre-emptively at least include build |
59 |
tests for your specific USE flags." - and again, this means we can |
60 |
actually have our tooling use these stats to generate build tests for |
61 |
the "known popular" configs. |
62 |
|
63 |
I point you to RHEL - why are people willing to pay for for RHEL? What |
64 |
do they get for that buck? Because I promise you, the support I get |
65 |
from fellow Gentoo'ers FAR outweigh the support I have ever gotten from |
66 |
(paid for) RHEL. Most of the time. |
67 |
|
68 |
I myself used to run 500+ Gentoo hosts more than 15 years back. It was |
69 |
fun. I was also a student back then so had much more time on my hands |
70 |
than I do now. It was challenging, and fun to try and get things to |
71 |
work exactly the way we envisioned it should. I promise you, if what |
72 |
Michał proposes was available for me back then to firstly keep track of |
73 |
my own internal assets, and to submit stats upstream to help improve |
74 |
Gentoo I would not have hesitated for 10 seconds. |
75 |
|
76 |
And there I touch on a point I'm trying to make - this should be |
77 |
something that not only helps devs, but brings benefit to users. I'll |
78 |
say more on this at the end of the email (possibly force users to run |
79 |
some of their own infra for this at least, but these stats form the |
80 |
framework for a multi-system management system too, potentially). First |
81 |
I'd like to pay more attention to the individual points raised by Michał. |
82 |
|
83 |
On 2020/04/26 10:08, Michał Górny wrote: |
84 |
|
85 |
> Hi, |
86 |
> |
87 |
> The topic of rebooting gentoostats comes here from time to time. Unless |
88 |
> I'm mistaken, all the efforts so far were superficial, lacking a clear |
89 |
> plan and unwilling to research the problems. I'd like to start |
90 |
> a serious discussion focused on the issues we need to solve, and propose |
91 |
> some ideas how we could solve them. |
92 |
> |
93 |
> I can't promise I'll find time to implement it. However, I'd like to |
94 |
> get a clear plan on how it should be done if someone actually does it. |
95 |
|
96 |
My time is also limited, but I would love to be involved in some way or |
97 |
another. |
98 |
|
99 |
> The big questions |
100 |
> ================= |
101 |
> The way I see it, the primary goal of the project would be to gather |
102 |
> statistics on popularity of packages, in order to help us prioritize our |
103 |
> attention and make decisions on what to keep and what to remove. Unlike |
104 |
> Debian's popcon, I don't think we really want to try to investigate |
105 |
> which files are actually used but focus on what's installed. |
106 |
> |
107 |
> There are a few important questions that need to be answered first: |
108 |
> |
109 |
> 1. Which data do we need to collect? |
110 |
> |
111 |
> a. list of installed packages? |
112 |
> b. versions (or slots?) of installed packages? |
113 |
> c. USE flags on installed packages? |
114 |
> d. world and world_sets files |
115 |
> e. system profile? |
116 |
> f. enabled repositories? (possibly filtered to official list) |
117 |
All of the above. Including exact versions and USE flags for each |
118 |
package. Also, I'm sure there are others, but I sometimes have systems |
119 |
that fall behind on certain packages, either by no longer being included |
120 |
from world or for other reasons (eg, a specific SLOT that no longer |
121 |
updates for some reason, although this situation has improved). |
122 |
> g. distribution? |
123 |
/etc/gentoo-release? |
124 |
|
125 |
Yes, I think so, that partially deals with your "derivative distributions". |
126 |
|
127 |
h. date+time of last successful emerge --sync (probably individually |
128 |
for each repository). |
129 |
i. /var/log/emerge.log |
130 |
j. hardware data, eg, amount of RAM, CPU clock speed/cores, disks. |
131 |
k. hostname + other network info (IP address). |
132 |
|
133 |
i - build failures might be helpful. Might be useful to get exact |
134 |
merge times assuming that users want some extra features for user |
135 |
benefit, not gentoo dev benefit. |
136 |
j,k - definitely not of use to devs, but possibly to users as a form of |
137 |
"hardware inventory". |
138 |
|
139 |
Much of this is definitely not data that we want/need, but if the data |
140 |
gets proxied, then we and our users can use this as a form of inventory |
141 |
management system too. |
142 |
|
143 |
> I think d. is most important as it gives us information on what users |
144 |
> really want. a. alone is kinda redundant is we have d. c. might have |
145 |
> some value when deciding whether to mask a particular flag (and implies |
146 |
> a.). |
147 |
> |
148 |
> e. would be valuable if we wanted to determine the future of particular |
149 |
> profiles, as well as e.g. estimate the transition to new versions. |
150 |
> |
151 |
> f. would be valuable to determine which repositories are used but we |
152 |
> need to filter private repos from the output for privacy reasons. |
153 |
I agree with all of this. |
154 |
> g. could be valuable in correlation with other data but not sure if |
155 |
> there's much direct value alone. |
156 |
Don't think so, but see your own point 2. |
157 |
> |
158 |
> |
159 |
> 2. How to handle Gentoo derivatives? Some of them could provide |
160 |
> meaningful data but some could provide false data (e.g. when derivatives |
161 |
> override Gentoo packages). One possible option would be to filter a.-e. |
162 |
> to stuff coming from ::gentoo. |
163 |
|
164 |
It may be of benefit to know which ::gentoo packages they are using, and |
165 |
if we make the code available to those distributions as a form of |
166 |
proxy/peer, then any hosts that submit directly to Gentoo we could |
167 |
dispatch to that distributions' infra, or if we're really nice, just |
168 |
keep it and strip out the packages we don't maintain (ie, not ::gentoo |
169 |
or official repositories). |
170 |
|
171 |
> |
172 |
> |
173 |
> 3. How to keep the data up-to-date? After all, if we just stack a lot |
174 |
> of old data, we will soon stop getting meaningful results. I suppose |
175 |
> we'll need to timestamp all data and remove old entries. |
176 |
|
177 |
My opinion on this, automated cron, that dispatches daily. At least |
178 |
weekly. Daily provides better granularity for some other ideas aimed at |
179 |
system administrators. Eg, when did what change? I shove /etc into git |
180 |
for this reason alone with a nightly cron to commit everything and push |
181 |
it to a remote server, also serves as a form of configuration backup. |
182 |
|
183 |
> |
184 |
> |
185 |
> 4. How to avoid duplication? If some users submit their results more |
186 |
> often than others, they would bias the results. 3. might be related. |
187 |
|
188 |
I think this directly relate to SPAM. So I fully agree with the UUID |
189 |
per installation concept. But then systems get cloned (our labs used to |
190 |
be updated on a single machine, then we utilized udpcast to update the |
191 |
rest of the systems, so they would all end up with the same UUID). So |
192 |
the primary purpose of this is to find the origin of the installation, |
193 |
but can be trivially bypassed either by force generating a new UUID, or |
194 |
copying from other machines, so this can be trivially manipulated. |
195 |
|
196 |
I think we need to add a secondary, hardware based identifier. |
197 |
|
198 |
Digium (now Sangoma) checks for all MAC addresses for ethX, starting |
199 |
from 0 until the ioctl gets a failure, if eth0 fails, it basically does |
200 |
"ip ad sh" and end up including the same MAC multiple times, and in |
201 |
arbitrary order since the NICs aren't guaranteed to be detected in the |
202 |
same order on every boot. This (or a related) method could work, so |
203 |
generate some unique hardware-based identifier, then hash it using say |
204 |
SHA-256 or BLAKE2 to generate something which can't be trivially |
205 |
reversed back to the original identifier? Why ... well, anonymity :). |
206 |
We could even include the configured or dhcp obtained hostname into this. |
207 |
|
208 |
> 5. How to handle clusters? Things are simple if we can assume that |
209 |
> people will submit data for a few distinct systems. But what about |
210 |
> companies that run 50 Gentoo machines with the same or similar setup? |
211 |
> What about clusters of 1000 almost identical containers? Big entities |
212 |
> could easily bias the results but we should also make it possible for |
213 |
> them to participate somehow. |
214 |
Assuming they do what we did ... they'd probably (hopefully) all end up |
215 |
with the same (installation time?) UUID but different hardware |
216 |
identifiers. So we'd be able to identify them ... and enterprise idea, |
217 |
report back to those admins (assuming they registered these systems to |
218 |
their profile) that their clusters have discrepancies. |
219 |
> |
220 |
> |
221 |
> 6. Security. We don't want to expose information that could be |
222 |
> correlated to specific systems, as it could disclose their |
223 |
> vulnerabilities. |
224 |
|
225 |
Agreed. But some of this may have particular benefit for system |
226 |
administrators, so perhaps a secondary level of opt-in for providing |
227 |
"potentially sensitive data" if the Gentoo infra gets compromised. We |
228 |
could perhaps store a raw blob for these users that only gets decrypted |
229 |
by some key that only they should have/poses. |
230 |
|
231 |
Or, we could proxy the data, let the sensitive stuff travel to the |
232 |
proxy/aggregator, and strip that from going higher up. And they simply |
233 |
generate those reports locally on their proxy/aggregator. |
234 |
|
235 |
> |
236 |
> |
237 |
> 7. Privacy. Besides the above, our sysadmins would appreciate if |
238 |
> the data they submitted couldn't be easily correlated to them. If we |
239 |
> don't respect privacy of our users, we won't get them to submit data. |
240 |
|
241 |
I'm happy with either blind UUID + HW-related-hash submission, without |
242 |
any further data, but would really appreciate if users are willing to |
243 |
register. This would have the following benefits IMHO: |
244 |
|
245 |
They could subscribe for news items that affects them. |
246 |
They could subscribe for receiving GLSAs for packages that affect their |
247 |
systems. |
248 |
They could get a view of all their systems from a central "management" |
249 |
interface. |
250 |
|
251 |
I have a need to be able to ask the asterisk users on Gentoo what they |
252 |
need/want. As it stands, I'm suffering from "user blindness". Again, I |
253 |
have my own needs, and I scratch those, but helping others to get their |
254 |
needs scratched is a good thing. If you don't want to participate, |
255 |
that's fine, but if you do, you get to reap the benefit. Towards this |
256 |
end, and perhaps enabling some users to provide some feedback a further |
257 |
future step may be to enable users to anonymously submit requests via |
258 |
the system. Or we could get anonymous feedback from users from whom |
259 |
we'd normally not get any. So if the core infra on this has email |
260 |
addresses for all users, it could send out the email on-behalf-of the |
261 |
package maintainer, and feedback could then be submitted via some |
262 |
anonymous mechanism (eg, link in email that takes the user to a |
263 |
submissions page, and we explicitly don't encode per-recipient |
264 |
cookie-style data into the link). An idea. |
265 |
|
266 |
> |
267 |
> |
268 |
> 8. Spam protection. Finally, the service needs to be resilient to being |
269 |
> spammed with fake data. Both to users who want to make their packages |
270 |
> look more important, and to script kiddies that want to prove a point. |
271 |
|
272 |
Data only gets included after being kept up to date for a period of at |
273 |
least X days. Based on generated UUID + HW-Hash. UUID is (optionally |
274 |
but ideally) linked to a user profile. HW-Hash is just to identify |
275 |
unique systems. |
276 |
|
277 |
Data that doens't get kept up to date could be filtered out after Y |
278 |
days, where Y <= X. That way a spammer would at least need to take the |
279 |
effort of keeping his spamming effort going for X number of days with X |
280 |
number of unique (trivially spoofable) identifiers. So we don't deny |
281 |
that it can be done, I'm just not sure we care? |
282 |
|
283 |
Other than me, who would benefit to spoof stats for asterisk for |
284 |
example? Perhaps someone with a grudge? But they have my email address |
285 |
anyway ... so can do far worse than generate a few spoofed submissions. |
286 |
|
287 |
> My (partial) implementation idea |
288 |
> ================================ |
289 |
> I think our approach should be oriented on privacy/security first, |
290 |
> and attempt to make the best of the data we can get while respecting |
291 |
> this principle. This means no correlation and no tracking. |
292 |
|
293 |
I both agree and disagree. The most basic premise should be no |
294 |
tracking/correlation unless the user specifically request it towards |
295 |
specific functionality (eg, emailing of affecting GLSAs/news items, |
296 |
single-platform for viewing my hosts and what their status are). |
297 |
|
298 |
> Once the tool is installed, the user needs to opt-in to using it. This |
299 |
> involves accepting a privacy policy and setting up a cronjob. The tool |
300 |
> would suggest a (random?) time for submission to take place periodically |
301 |
> (say, every week). |
302 |
|
303 |
As above, I'd do this as part of accepting a license that states by |
304 |
accepting this license you accept the most basic submission of stats in |
305 |
an anonymous manner including only the most basic of identifier |
306 |
information to identify unique systems. |
307 |
|
308 |
> The submission would contain only raw data, without any identification |
309 |
> information. It would be encrypted using our public key. Once |
310 |
> uploaded, it would be put into our input queue as-is. |
311 |
|
312 |
Correct. Explicit action required to register UUID to user profile. If |
313 |
that is an option. |
314 |
|
315 |
Eg, gentoo-stat --link-to jaco@×××××××.za |
316 |
|
317 |
Then prompt for my password, which I then need to enter in order to link |
318 |
the UUID of the current system to my registered profile. |
319 |
|
320 |
So completely anonymous, with minimum data, unless specifically |
321 |
configured otherwise. |
322 |
|
323 |
> |
324 |
> Periodically the input queue would be processed in bulk. The individual |
325 |
> statistics would be updated and the input would be discarded. This |
326 |
> should prevent people trying to correlate changes in statistics with |
327 |
> individual uploads. |
328 |
|
329 |
Ok. This makes makes sense. As a sysadmin I'd like that data to be |
330 |
available for say 30 to 60 or even 90 days, or at least "what changed |
331 |
from submission X to X+1 spanning the period", because then if something |
332 |
breaks, I can ask "when did it break?" and then I can ask the stats |
333 |
system "what changed on the related systems around that time?". At |
334 |
Gentoo core infra level, we can potentially discard as soon as |
335 |
processed, but depending on the algorihm we may need to keep at least |
336 |
the latest submitted copy for Y number of days (as defined above). |
337 |
|
338 |
Ok, yes, I can do that by working through /var/log/emerge.log as well, |
339 |
or genlop -l, but I need to do that system by system. If I have an |
340 |
environment of 500 hosts this gets tedious. Or what if I'd like to find |
341 |
what differs between a set of hosts where a feature X works, and others |
342 |
that don't? |
343 |
|
344 |
> |
345 |
> What do you think? Do you foresee other problems? Do you have other |
346 |
> needs? Can you think of better solutions? |
347 |
> |
348 |
I think we should build a hierarchy. So Gentoo-infra at the top. End |
349 |
users may submit only certain types of data there, all other data we as |
350 |
devs don't care about gets discarded, and if we allow users to register |
351 |
there directly we limit the functionality thereof in order to maintain |
352 |
the requirements of the developers here first and foremost. |
353 |
|
354 |
As such, the submitted package should be based on "data sets" in my |
355 |
opinion, where the most basic sets could be: |
356 |
|
357 |
core: |
358 |
a) package list including versions and use flags |
359 |
b) world and world_sets |
360 |
c) uuid |
361 |
d) hash(hardware ident) |
362 |
|
363 |
hardware: |
364 |
a) RAM |
365 |
b) ... |
366 |
|
367 |
network: |
368 |
a) ... |
369 |
|
370 |
At the Gentoo-infra layer we can then have a policy that we ONLY accept |
371 |
"core" sets. If it's easy to at the proxy/aggregator level define your |
372 |
own sets, and provide mechanisms to obtain the data (or as plugins on |
373 |
the hosts themselves, eg, USE="hardware network" gentoo-stats-plugins |
374 |
style, with the main package only containing what the devs need. Just |
375 |
ideas. |
376 |
|
377 |
Further down the hierarchy additional sets could be defined, and |
378 |
proxy/aggregator hosts could define what information they allow higher |
379 |
up the hierarchy. |
380 |
|
381 |
If we receive information for a gentoo derivate we redirect it to that |
382 |
distribution. Although for such a case we really should provide a way |
383 |
for derivatives to specify their own "default" infra. |
384 |
|
385 |
Other projects can then build on top of, or as plug-ins of the core |
386 |
stats project to then provide the more enterprise-like features. One |
387 |
could potentially even go as far as automated updating driven from a |
388 |
central control server in a networked environment where the |
389 |
proxy/aggregator is able to connect back to the individual hosts to |
390 |
execute commands on them. |
391 |
|
392 |
I sincerely hope my ramblings haven't been completely off point. I |
393 |
believe the above shows that this can be of benefit to users and |
394 |
developers alike, and hopefully in a way that does not infringe on user |
395 |
users' rights or privacy. |
396 |
|
397 |
One thing could be for aggregators to submit aggregated stats instead of |
398 |
individual systems, again, same X and Y stuff would apply, however, I |
399 |
think for aggregated submissions the data skew risk becomes even |
400 |
larger. So perhaps we should provide two sets of stats "excluding |
401 |
aggregated stats" and including, or possibly we can mark some |
402 |
aggregators as trusted. I dunno. |
403 |
|
404 |
Kind Regards, |
405 |
Jaco |