Gentoo Archives: gentoo-dev

From: Jaco Kroon <jaco@××××××.za>
To: gentoo-dev@l.g.o, "Michał Górny" <mgorny@g.o>
Subject: Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
Date: Tue, 05 May 2020 18:04:36
Message-Id: 93fdada0-857e-67f4-553b-e4aa81cf3373@uls.co.za
In Reply to: [gentoo-dev] [RFC] Ideas for gentoostats implementation by "Michał Górny"
1 Hi Michał, and the rest of the Gentoo devs,
2
3 I've been patiently sitting and watching this discussion.
4
5 I raised some ideas with another developer (Not Michał) just days before
6 he raised this thread to the ML.
7
8 I believe all points raised to this point is valid, I'll try to summarise:
9
10 1.  This must be completely *opt in*.
11 2.  Anonymity was discussed by various parties (privacy).
12 3.  "spam" protection (ie, preventing bogus data from entering).
13 4.  Trustworthiness of data.
14 5.  Acceptance of some form of privacy policy.
15
16 In my opinion, points 2 and 3 works against each other, in that if
17 registration is compulsory if you would like to submit stats, then we
18 can control the spam more easily (not foolproof), but requiring
19 registration also raises the entry barrier.  I'd be completely willing
20 to provide at least an email address as part of a submission.
21
22 All of the replies seems to have focused purely on yes/no, do it or
23 don't.  Not many have addressed the benefits to end users/system
24 administrators.  It seems to focus is on what we as developers can get
25 out of this.
26
27 Regarding the above points:
28
29 1.  I fully agree.  This should not be forced on anyone.
30 2.  Happy to concede that some people may wish to submit anonymously. 
31 Let them.
32 3.  I'll address this below.
33 4.  A lot of the discussion has been around the usefulness of the data,
34 and I concede to Thomas that this may (or may not) generate "decision
35 blind spots" or as per "artificially increase decision certainty".  I
36 don't see how this is worse than what we've got now.
37 5.  We have the infrastructure for this already by way of licenses.  So
38 we ship with "GPLv2/3/whatever + GentooPrivacy", and users have to first
39 take explicit action to accept GentooPrivacy.
40
41 I have some other ideas around this, which will tread even further on
42 privacy, but again, all of this should be a kind of opt-in, and building
43 on the ideas by Kent where he suggested a form of submission proxy
44 (STATS_SERVER), we could potentially give the full benefit of the code
45 to such entities, but then still allow them to submit "upstream" in a
46 more filtered manner.
47
48 Bottom line, in my opinion:  Any data is better than no data!
49
50 Whilst we can't say "no one is using xyz", we will at least be able to
51 say "hey, some people are using xyz", and whilst this may generate some
52 blinds it at least enables us to test known use cases during
53 test-builds, eg, we know for a fact a thousand users are using package X
54 with USE flags "-* a b c", so we should definitely run that as a compile
55 test.  Your build breaks frequently?  Would you mind submitting stats? 
56 Great thank you.  You not willing to do that, then my stance becomes one
57 of "ok, I'll help where I can, but really, please consider us to help
58 you, if you submit stats we can pre-emptively at least include build
59 tests for your specific USE flags." - and again, this means we can
60 actually have our tooling use these stats to generate build tests for
61 the "known popular" configs.
62
63 I point you to RHEL - why are people willing to pay for for RHEL?  What
64 do they get for that buck?  Because I promise you, the support I get
65 from fellow Gentoo'ers FAR outweigh the support I have ever gotten from
66 (paid for) RHEL.  Most of the time.
67
68 I myself used to run 500+ Gentoo hosts more than 15 years back.  It was
69 fun.  I was also a student back then so had much more time on my hands
70 than I do now.  It was challenging, and fun to try and get things to
71 work exactly the way we envisioned it should.  I promise you, if what
72 Michał proposes was available for me back then to firstly keep track of
73 my own internal assets, and to submit stats upstream to help improve
74 Gentoo I would not have hesitated for 10 seconds.
75
76 And there I touch on a point I'm trying to make - this should be
77 something that not only helps devs, but brings benefit to users.  I'll
78 say more on this at the end of the email (possibly force users to run
79 some of their own infra for this at least, but these stats form the
80 framework for a multi-system management system too, potentially).  First
81 I'd like to pay more attention to the individual points raised by Michał.
82
83 On 2020/04/26 10:08, Michał Górny wrote:
84
85 > Hi,
86 >
87 > The topic of rebooting gentoostats comes here from time to time. Unless
88 > I'm mistaken, all the efforts so far were superficial, lacking a clear
89 > plan and unwilling to research the problems. I'd like to start
90 > a serious discussion focused on the issues we need to solve, and propose
91 > some ideas how we could solve them.
92 >
93 > I can't promise I'll find time to implement it. However, I'd like to
94 > get a clear plan on how it should be done if someone actually does it.
95
96 My time is also limited, but I would love to be involved in some way or
97 another.
98
99 > The big questions
100 > =================
101 > The way I see it, the primary goal of the project would be to gather
102 > statistics on popularity of packages, in order to help us prioritize our
103 > attention and make decisions on what to keep and what to remove. Unlike
104 > Debian's popcon, I don't think we really want to try to investigate
105 > which files are actually used but focus on what's installed.
106 >
107 > There are a few important questions that need to be answered first:
108 >
109 > 1. Which data do we need to collect?
110 >
111 > a. list of installed packages?
112 > b. versions (or slots?) of installed packages?
113 > c. USE flags on installed packages?
114 > d. world and world_sets files
115 > e. system profile?
116 > f. enabled repositories? (possibly filtered to official list)
117 All of the above.  Including exact versions and USE flags for each
118 package.  Also, I'm sure there are others, but I sometimes have systems
119 that fall behind on certain packages, either by no longer being included
120 from world or for other reasons (eg, a specific SLOT that no longer
121 updates for some reason, although this situation has improved).
122 > g. distribution?
123 /etc/gentoo-release?
124
125 Yes, I think so, that partially deals with your "derivative distributions".
126
127 h.  date+time of last successful emerge --sync (probably individually
128 for each repository).
129 i.  /var/log/emerge.log
130 j.  hardware data, eg, amount of RAM, CPU clock speed/cores, disks.
131 k.  hostname + other network info (IP address).
132
133 i - build failures might be helpful.  Might be useful  to get exact
134 merge times assuming that users want some extra features for user
135 benefit, not gentoo dev benefit.
136 j,k - definitely not of use to devs, but possibly to users as a form of
137 "hardware inventory".
138
139 Much of this is definitely not data that we want/need, but if the data
140 gets proxied, then we and our users can use this as a form of inventory
141 management system too.
142
143 > I think d. is most important as it gives us information on what users
144 > really want. a. alone is kinda redundant is we have d. c. might have
145 > some value when deciding whether to mask a particular flag (and implies
146 > a.).
147 >
148 > e. would be valuable if we wanted to determine the future of particular
149 > profiles, as well as e.g. estimate the transition to new versions.
150 >
151 > f. would be valuable to determine which repositories are used but we
152 > need to filter private repos from the output for privacy reasons.
153 I agree with all of this.
154 > g. could be valuable in correlation with other data but not sure if
155 > there's much direct value alone.
156 Don't think so, but see your own point 2.
157 >
158 >
159 > 2. How to handle Gentoo derivatives? Some of them could provide
160 > meaningful data but some could provide false data (e.g. when derivatives
161 > override Gentoo packages). One possible option would be to filter a.-e.
162 > to stuff coming from ::gentoo.
163
164 It may be of benefit to know which ::gentoo packages they are using, and
165 if we make the code available to those distributions as a form of
166 proxy/peer, then any hosts that submit directly to Gentoo we could
167 dispatch to that distributions' infra, or if we're really nice, just
168 keep it and strip out the packages we don't maintain (ie, not ::gentoo
169 or official repositories).
170
171 >
172 >
173 > 3. How to keep the data up-to-date? After all, if we just stack a lot
174 > of old data, we will soon stop getting meaningful results. I suppose
175 > we'll need to timestamp all data and remove old entries.
176
177 My opinion on this, automated cron, that dispatches daily.  At least
178 weekly.  Daily provides better granularity for some other ideas aimed at
179 system administrators.  Eg, when did what change?  I shove /etc into git
180 for this reason alone with a nightly cron to commit everything and push
181 it to a remote server, also serves as a form of configuration backup.
182
183 >
184 >
185 > 4. How to avoid duplication? If some users submit their results more
186 > often than others, they would bias the results. 3. might be related.
187
188 I think this directly relate to SPAM.  So I fully agree with the UUID
189 per installation concept.  But then systems get cloned (our labs used to
190 be updated on a single machine, then we utilized udpcast to update the
191 rest of the systems, so they would all end up with the same UUID).  So
192 the primary purpose of this is to find the origin of the installation,
193 but can be trivially bypassed either by force generating a new UUID, or
194 copying from other machines, so this can be trivially manipulated.
195
196 I think we need to add a secondary, hardware based identifier.
197
198 Digium (now Sangoma) checks for all MAC addresses for ethX, starting
199 from 0 until the ioctl gets a failure, if eth0 fails, it basically does
200 "ip ad sh" and end up including the same MAC multiple times, and in
201 arbitrary order since the NICs aren't guaranteed to be detected in the
202 same order on every boot.  This (or a related) method could work, so
203 generate some unique hardware-based identifier, then hash it using say
204 SHA-256 or BLAKE2 to generate something which can't be trivially
205 reversed back to the original identifier?  Why ... well, anonymity :). 
206 We could even include the configured or dhcp obtained hostname into this.
207
208 > 5. How to handle clusters? Things are simple if we can assume that
209 > people will submit data for a few distinct systems. But what about
210 > companies that run 50 Gentoo machines with the same or similar setup?
211 > What about clusters of 1000 almost identical containers? Big entities
212 > could easily bias the results but we should also make it possible for
213 > them to participate somehow.
214 Assuming they do what we did ... they'd probably (hopefully) all end up
215 with the same (installation time?) UUID but different hardware
216 identifiers.  So we'd be able to identify them ... and enterprise idea,
217 report back to those admins (assuming they registered these systems to
218 their profile) that their clusters have discrepancies.
219 >
220 >
221 > 6. Security. We don't want to expose information that could be
222 > correlated to specific systems, as it could disclose their
223 > vulnerabilities.
224
225 Agreed.  But some of this may have particular benefit for system
226 administrators, so perhaps a secondary level of opt-in for providing
227 "potentially sensitive data" if the Gentoo infra gets compromised.  We
228 could perhaps store a raw blob for these users that only gets decrypted
229 by some key that only they should have/poses.
230
231 Or, we could proxy the data, let the sensitive stuff travel to the
232 proxy/aggregator, and strip that from going higher up.  And they simply
233 generate those reports locally on their proxy/aggregator.
234
235 >
236 >
237 > 7. Privacy. Besides the above, our sysadmins would appreciate if
238 > the data they submitted couldn't be easily correlated to them. If we
239 > don't respect privacy of our users, we won't get them to submit data.
240
241 I'm happy with either blind UUID + HW-related-hash submission, without
242 any further data, but would really appreciate if users are willing to
243 register.  This would have the following benefits IMHO:
244
245 They could subscribe for news items that affects them.
246 They could subscribe for receiving GLSAs for packages that affect their
247 systems.
248 They could get a view of all their systems from a central "management"
249 interface.
250
251 I have a need to be able to ask the asterisk users on Gentoo what they
252 need/want.  As it stands, I'm suffering from "user blindness".  Again, I
253 have my own needs, and I scratch those, but helping others to get their
254 needs scratched is a good thing.  If you don't want to participate,
255 that's fine, but if you do, you get to reap the benefit.  Towards this
256 end, and perhaps enabling some users to provide some feedback a further
257 future step may be to enable users to anonymously submit requests via
258 the system.  Or we could get anonymous feedback from users from whom
259 we'd normally not get any.  So if the core infra on this has email
260 addresses for all users, it could send out the email on-behalf-of the
261 package maintainer, and feedback could then be submitted via some
262 anonymous mechanism (eg, link in email that takes the user to a
263 submissions page, and we explicitly don't encode per-recipient
264 cookie-style data into the link).  An idea.
265
266 >
267 >
268 > 8. Spam protection. Finally, the service needs to be resilient to being
269 > spammed with fake data. Both to users who want to make their packages
270 > look more important, and to script kiddies that want to prove a point.
271
272 Data only gets included after being kept up to date for a period of at
273 least X days.  Based on generated UUID + HW-Hash.  UUID is (optionally
274 but ideally) linked to a user profile.  HW-Hash is just to identify
275 unique systems.
276
277 Data that doens't get kept up to date could be filtered out after Y
278 days, where Y <= X.  That way a spammer would at least need to take the
279 effort of keeping his spamming effort going for X number of days with X
280 number of unique (trivially spoofable) identifiers.  So we don't deny
281 that it can be done, I'm just not sure we care?
282
283 Other than me, who would benefit to spoof stats for asterisk for
284 example?  Perhaps someone with a grudge?  But they have my email address
285 anyway ... so can do far worse than generate a few spoofed submissions.
286
287 > My (partial) implementation idea
288 > ================================
289 > I think our approach should be oriented on privacy/security first,
290 > and attempt to make the best of the data we can get while respecting
291 > this principle. This means no correlation and no tracking.
292
293 I both agree and disagree.  The most basic premise should be no
294 tracking/correlation unless the user specifically request it towards
295 specific functionality (eg, emailing of affecting GLSAs/news items,
296 single-platform for viewing my hosts and what their status are).
297
298 > Once the tool is installed, the user needs to opt-in to using it. This
299 > involves accepting a privacy policy and setting up a cronjob. The tool
300 > would suggest a (random?) time for submission to take place periodically
301 > (say, every week).
302
303 As above, I'd do this as part of accepting a license that states by
304 accepting this license you accept the most basic submission of stats in
305 an anonymous manner including only the most basic of identifier
306 information to identify unique systems.
307
308 > The submission would contain only raw data, without any identification
309 > information. It would be encrypted using our public key. Once
310 > uploaded, it would be put into our input queue as-is.
311
312 Correct.  Explicit action required to register UUID to user profile.  If
313 that is an option.
314
315 Eg, gentoo-stat --link-to jaco@×××××××.za
316
317 Then prompt for my password, which I then need to enter in order to link
318 the UUID of the current system to my registered profile.
319
320 So completely anonymous, with minimum data, unless specifically
321 configured otherwise.
322
323 >
324 > Periodically the input queue would be processed in bulk. The individual
325 > statistics would be updated and the input would be discarded. This
326 > should prevent people trying to correlate changes in statistics with
327 > individual uploads.
328
329 Ok.  This makes makes sense.  As a sysadmin I'd like that data to be
330 available for say 30 to 60 or even 90 days, or at least "what changed
331 from submission X to X+1 spanning the period", because then if something
332 breaks, I can ask "when did it break?" and then I can ask the stats
333 system "what changed on the related systems around that time?".  At
334 Gentoo core infra level, we can potentially discard as  soon as
335 processed, but depending on the algorihm we may need to keep at least
336 the latest submitted copy for Y number of days (as defined above).
337
338 Ok, yes, I can do that by working through /var/log/emerge.log as well,
339 or genlop -l, but I need to do that system by system.  If I have an
340 environment of 500 hosts this gets tedious.  Or what if I'd like to find
341 what differs between a set of hosts where a feature X works, and others
342 that don't?
343
344 >
345 > What do you think? Do you foresee other problems? Do you have other
346 > needs? Can you think of better solutions?
347 >
348 I think we should build a hierarchy.  So Gentoo-infra at the top.  End
349 users may submit only certain types of data there, all other data we as
350 devs don't care about gets discarded, and if we allow users to register
351 there directly we limit the functionality thereof in order to maintain
352 the requirements of the developers here first and foremost.
353
354 As such, the submitted package should be based on "data sets" in my
355 opinion, where the most basic sets could be:
356
357 core:
358   a) package list including versions and use flags
359   b) world and world_sets
360   c) uuid
361   d) hash(hardware ident)
362
363 hardware:
364   a) RAM
365   b) ...
366
367 network:
368   a) ...
369
370 At the Gentoo-infra layer we can then have a policy that we ONLY accept
371 "core" sets.  If it's easy to at the proxy/aggregator level define your
372 own sets, and provide mechanisms to obtain the data (or as plugins on
373 the hosts themselves, eg, USE="hardware network" gentoo-stats-plugins
374 style, with the main package only containing what the devs need.  Just
375 ideas.
376
377 Further down the hierarchy additional sets could be defined, and
378 proxy/aggregator hosts could define what information they allow higher
379 up the hierarchy.
380
381 If we receive information for a gentoo derivate we redirect it to that
382 distribution.  Although for such a case we really should provide a way
383 for derivatives to specify their own "default" infra.
384
385 Other projects can then build on top of, or as plug-ins of the core
386 stats project to then provide the more enterprise-like features.  One
387 could potentially even go as far as automated updating driven from a
388 central control server in a networked environment where the
389 proxy/aggregator is able to connect back to the individual hosts to
390 execute commands on them.
391
392 I sincerely hope my ramblings haven't been completely off point.  I
393 believe the above shows that this can be of benefit to users and
394 developers alike, and hopefully in a way that does not infringe on user
395 users' rights or privacy.
396
397 One thing could be for aggregators to submit aggregated stats instead of
398 individual systems, again, same X and Y stuff would apply, however, I
399 think for aggregated submissions the data skew risk becomes even
400 larger.  So perhaps we should provide two sets of stats "excluding
401 aggregated stats" and including, or possibly we can mark some
402 aggregators as trusted.  I dunno.
403
404 Kind Regards,
405 Jaco