Gentoo Archives: gentoo-soc

From: Sebastian Pipping <webmaster@××××××××.org>
To: gentoo-soc@l.g.o
Subject: [gentoo-soc] Gentoo stats gathering vs. privacy protection (was Re: About "Create and release a Gentoo stats server/client")
Date: Tue, 07 Apr 2009 04:03:31
Message-Id: 49DACFE5.3060402@hartwork.org
In Reply to: Re: [gentoo-soc] About "Create and release a Gentoo stats server/client" by Fabian Groffen
1 Hello again!
2
3
4 A few fresh thoughts on data privacy protection and future Gentoo stats
5 gathering from talking to an expert earlier today. If it doesn't make
6 sense, it's my fault not his ;-)
7
8 I'm sure you will have additions and corrections to this.
9 Go ahead, I want you to.
10
11
12 == Simplified overview ==
13 The data we intend to collect is bound to machines. If every machines
14 submitting information sends some unique identifier we know that a
15 machine with that identifier has the properties submitted. As we won't
16 store IPs nobody with the collected data will be able to reduce some
17 submitted machine config back to the location of the machine, the name
18 of the admin or so.
19
20
21 == Exceptions ==
22 The point were that generality breaks where the data you submit can be
23 linked with information from other sources. If anything of the data you
24 submit is occuring "rarely enough" in the wild it can allow mapping some
25 machine config back to information about the machine's location or its
26 users.
27
28 Example:
29 If you're the only one who has the only ever produced 645-core CPU
30 running at home, the guy who sold it to you can map your data back
31 to your name and address, provided he's able to find the bill for it.
32
33
34 == Counter-measures ==
35 To reduce this rare-configuration issue one could only list
36 pieces of information that has reached a certain minimum, say 25
37 occurrences.
38
39 Example:
40 "Gentoo" would not be showing up in the Operating System
41 section unless at least 25 submissions with OS "Gentoo" have
42 been received.
43
44 Smolt is effectively doing something like that as my Gentoo submission
45 is not shown though my data shows at other places at their page.
46 The minimum occurrence of an OS on that page is 27 because the top 30
47 entries are listed. If I submit 28 fake data sets it should show up
48 in the list. Anybody else could do that too and therefore find my
49 Gentoo-OS entry as he can identify his own fake entries easily.
50 All he'll get though is my machine ID if it's exposed but that's it.
51 As I am writing about it I add extra information allowing you to resolve
52 that entry back to my person. So that's another example of linking with
53 other information.
54
55
56 == Conclusions ==
57 People who have super-secret custom setups that nobody must know about
58 will not want to give us their data. That's okay. All other people
59 with avarage desktop machines won't have any rare data to submit.
60
61
62 == Proposal ==
63 - Use machine IDs on server side to estimate the set of hosts
64 involved and to refresh data for that machine later
65 - State what data we gather and what we do with it.
66 - Make stat submitters question themselves if their setup has
67 anything top-secret to it. An informative and not-over-frightening
68 is very important at that point.
69 - Allow them to configure-away that data from what they submit
70
71 They keep what the want to keep, we get what we want to get.
72
73
74 What do you think?
75
76
77
78 Sebastian