1 |
Hello again! |
2 |
|
3 |
|
4 |
A few fresh thoughts on data privacy protection and future Gentoo stats |
5 |
gathering from talking to an expert earlier today. If it doesn't make |
6 |
sense, it's my fault not his ;-) |
7 |
|
8 |
I'm sure you will have additions and corrections to this. |
9 |
Go ahead, I want you to. |
10 |
|
11 |
|
12 |
== Simplified overview == |
13 |
The data we intend to collect is bound to machines. If every machines |
14 |
submitting information sends some unique identifier we know that a |
15 |
machine with that identifier has the properties submitted. As we won't |
16 |
store IPs nobody with the collected data will be able to reduce some |
17 |
submitted machine config back to the location of the machine, the name |
18 |
of the admin or so. |
19 |
|
20 |
|
21 |
== Exceptions == |
22 |
The point were that generality breaks where the data you submit can be |
23 |
linked with information from other sources. If anything of the data you |
24 |
submit is occuring "rarely enough" in the wild it can allow mapping some |
25 |
machine config back to information about the machine's location or its |
26 |
users. |
27 |
|
28 |
Example: |
29 |
If you're the only one who has the only ever produced 645-core CPU |
30 |
running at home, the guy who sold it to you can map your data back |
31 |
to your name and address, provided he's able to find the bill for it. |
32 |
|
33 |
|
34 |
== Counter-measures == |
35 |
To reduce this rare-configuration issue one could only list |
36 |
pieces of information that has reached a certain minimum, say 25 |
37 |
occurrences. |
38 |
|
39 |
Example: |
40 |
"Gentoo" would not be showing up in the Operating System |
41 |
section unless at least 25 submissions with OS "Gentoo" have |
42 |
been received. |
43 |
|
44 |
Smolt is effectively doing something like that as my Gentoo submission |
45 |
is not shown though my data shows at other places at their page. |
46 |
The minimum occurrence of an OS on that page is 27 because the top 30 |
47 |
entries are listed. If I submit 28 fake data sets it should show up |
48 |
in the list. Anybody else could do that too and therefore find my |
49 |
Gentoo-OS entry as he can identify his own fake entries easily. |
50 |
All he'll get though is my machine ID if it's exposed but that's it. |
51 |
As I am writing about it I add extra information allowing you to resolve |
52 |
that entry back to my person. So that's another example of linking with |
53 |
other information. |
54 |
|
55 |
|
56 |
== Conclusions == |
57 |
People who have super-secret custom setups that nobody must know about |
58 |
will not want to give us their data. That's okay. All other people |
59 |
with avarage desktop machines won't have any rare data to submit. |
60 |
|
61 |
|
62 |
== Proposal == |
63 |
- Use machine IDs on server side to estimate the set of hosts |
64 |
involved and to refresh data for that machine later |
65 |
- State what data we gather and what we do with it. |
66 |
- Make stat submitters question themselves if their setup has |
67 |
anything top-secret to it. An informative and not-over-frightening |
68 |
is very important at that point. |
69 |
- Allow them to configure-away that data from what they submit |
70 |
|
71 |
They keep what the want to keep, we get what we want to get. |
72 |
|
73 |
|
74 |
What do you think? |
75 |
|
76 |
|
77 |
|
78 |
Sebastian |