Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation - gentoo-dev

From:	Samuel Bernardo <samuelbernardo.mail@×××××.com>
To:	gentoo-dev@l.g.o
Subject:	Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
Date:	Sun, 26 Apr 2020 17:24:18
Message-Id:	`9a298a89-2fb9-7299-5617-9a3b0cb97c7c@gmail.com`
In Reply to:	[gentoo-dev] [RFC] Ideas for gentoostats implementation by "Michał Górny"

1

Hi everyone,

2

3

gentoostats is a novelty for me and I'm not aware of previous

4

discussions or implementations. But for what I could understand from the

5

comments and Michał Górny explanation, I would start to ask your

6

attention to octoverse[1] initiative.

7

8

Maybe collected statistics could be a possible from a platform to get

9

the additional metadata for the stats from user contribution. What I

10

mean is a way to have a broker to collect all statistics from an

11

organization internally and then to publish that in the end. With such

12

solution would allow to add value for enterprise statistics and also to

13

contribute in the end to Gentoo.

14

15

Each broker cloud use in the end git authentication to publish the

16

results with a merge request that would run the necessary hooks from

17

Gentoo side. We only need here a document specification for data parsing

18

in the end.

19

20

Sorry if my comment is completely out of context, but such an octoverse

21

for Gentoo would be very interesting in my perspective.

22

23

Best,

24

25

Samuel

26

27

[1] https://octoverse.github.com/

28

29

On 4/26/20 9:08 AM, Michał Górny wrote:

30

> Hi,

31

>

32

> The topic of rebooting gentoostats comes here from time to time.  Unless

33

> I'm mistaken, all the efforts so far were superficial, lacking a clear

34

> plan and unwilling to research the problems.  I'd like to start

35

> a serious discussion focused on the issues we need to solve, and propose

36

> some ideas how we could solve them.

37

>

38

> I can't promise I'll find time to implement it.  However, I'd like to

39

> get a clear plan on how it should be done if someone actually does it.

40

>

41

>

42

> The big questions

43

> =================

44

> The way I see it, the primary goal of the project would be to gather

45

> statistics on popularity of packages, in order to help us prioritize our

46

> attention and make decisions on what to keep and what to remove.  Unlike

47

> Debian's popcon, I don't think we really want to try to investigate

48

> which files are actually used but focus on what's installed.

49

>

50

> There are a few important questions that need to be answered first:

51

>

52

> 1. Which data do we need to collect?

53

>

54

>    a. list of installed packages?

55

>    b. versions (or slots?) of installed packages?

56

>    c. USE flags on installed packages?

57

>    d. world and world_sets files

58

>    e. system profile?

59

>    f. enabled repositories? (possibly filtered to official list)

60

>    g. distribution?

61

>

62

> I think d. is most important as it gives us information on what users

63

> really want.  a. alone is kinda redundant is we have d.  c. might have

64

> some value when deciding whether to mask a particular flag (and implies

65

> a.).

66

>

67

> e. would be valuable if we wanted to determine the future of particular

68

> profiles, as well as e.g. estimate the transition to new versions.

69

>

70

> f. would be valuable to determine which repositories are used but we

71

> need to filter private repos from the output for privacy reasons.

72

>

73

> g. could be valuable in correlation with other data but not sure if

74

> there's much direct value alone.

75

>

76

>

77

> 2. How to handle Gentoo derivatives?  Some of them could provide

78

> meaningful data but some could provide false data (e.g. when derivatives

79

> override Gentoo packages).  One possible option would be to filter a.-e.

80

> to stuff coming from ::gentoo.

81

>

82

>

83

> 3. How to keep the data up-to-date?  After all, if we just stack a lot

84

> of old data, we will soon stop getting meaningful results.  I suppose

85

> we'll need to timestamp all data and remove old entries.

86

>

87

>

88

> 4. How to avoid duplication?  If some users submit their results more

89

> often than others, they would bias the results.  3. might be related.

90

>

91

>

92

> 5. How to handle clusters?  Things are simple if we can assume that

93

> people will submit data for a few distinct systems.  But what about

94

> companies that run 50 Gentoo machines with the same or similar setup?

95

> What about clusters of 1000 almost identical containers?  Big entities

96

> could easily bias the results but we should also make it possible for

97

> them to participate somehow.

98

>

99

>

100

> 6. Security.  We don't want to expose information that could be

101

> correlated to specific systems, as it could disclose their

102

> vulnerabilities.

103

>

104

>

105

> 7. Privacy.  Besides the above, our sysadmins would appreciate if

106

> the data they submitted couldn't be easily correlated to them.  If we

107

> don't respect privacy of our users, we won't get them to submit data.

108

>

109

>

110

> 8. Spam protection.  Finally, the service needs to be resilient to being

111

> spammed with fake data.  Both to users who want to make their packages

112

> look more important, and to script kiddies that want to prove a point.

113

>

114

>

115

> My (partial) implementation idea

116

> ================================

117

> I think our approach should be oriented on privacy/security first,

118

> and attempt to make the best of the data we can get while respecting

119

> this principle.  This means no correlation and no tracking.

120

>

121

> Once the tool is installed, the user needs to opt-in to using it.  This

122

> involves accepting a privacy policy and setting up a cronjob.  The tool

123

> would suggest a (random?) time for submission to take place periodically

124

> (say, every week).

125

>

126

> The submission would contain only raw data, without any identification

127

> information.  It would be encrypted using our public key.  Once

128

> uploaded, it would be put into our input queue as-is.

129

>

130

> Periodically the input queue would be processed in bulk.  The individual

131

> statistics would be updated and the input would be discarded.  This

132

> should prevent people trying to correlate changes in statistics with

133

> individual uploads.

134

>

135

> Each counted item would have a timestamp associated, and we'd discard

136

> old items per resubmission period.  This should ensure that we keep

137

> fresh data and people can update their earlier submissions without

138

> storing identification data.

139

>

140

> For example, N users submit their data containing a list of packages

141

> every week.  This data is used in bulk to update counts of individual

142

> packages (technically, to append timestamps to list corresponding to

143

> these packages).  Data older than one week is discarded, so we have

144

> rough counts of package use during the last week.

145

>

146

> I think this addresses problems 3./6./7.

147

>

148

>

149

> The other major problem is spam protection.  The best semi-anonymous way

150

> I see is to use submitter's IPv4 addresses (can we support IPv6 then?).

151

> We could set a limit of, say, 10 submissions per IPv4 address per week.

152

> If some address would exceed that limit, we could require CAPTCHA

153

> authorization.

154

>

155

> I think this would make spamming a bit harder while keeping submissions

156

> easy for the most, and a little harder but possible for those of us

157

> behind ISP NATs.

158

>

159

> This should address problems 4./8. and maybe 5. to some degree.

160

>

161

>

162

> A proper solution to cluster problem would probably involve some way to

163

> internally collect and combine data data before submission.  If you have

164

> large clusters of similar systems, I think you'd want to have all

165

> packages used on different systems reported as one entry.

166

>

167

>

168

> I think we should collect data from users running all Gentoo

169

> derivatives, as long as they are using Gentoo packages.  The simplest

170

> solution I can think of would be to filter the results on packages (or

171

> profiles) installed from ::gentoo.  This will work only for distros that

172

> expose ::gentoo explicitly (vs copying our ebuilds to their

173

> repositories) though.

174

>

175

>

176

> What do you think?  Do you foresee other problems?  Do you have other

177

> needs?  Can you think of better solutions?

178

>

Gentoo Archives: gentoo-dev

Attachments

1	Hi everyone,
2
3	gentoostats is a novelty for me and I'm not aware of previous
4	discussions or implementations. But for what I could understand from the
5	comments and Michał Górny explanation, I would start to ask your
6	attention to octoverse[1] initiative.
7
8	Maybe collected statistics could be a possible from a platform to get
9	the additional metadata for the stats from user contribution. What I
10	mean is a way to have a broker to collect all statistics from an
11	organization internally and then to publish that in the end. With such
12	solution would allow to add value for enterprise statistics and also to
13	contribute in the end to Gentoo.
14
15	Each broker cloud use in the end git authentication to publish the
16	results with a merge request that would run the necessary hooks from
17	Gentoo side. We only need here a document specification for data parsing
18	in the end.
19
20	Sorry if my comment is completely out of context, but such an octoverse
21	for Gentoo would be very interesting in my perspective.
22
23	Best,
24
25	Samuel
26
27	[1] https://octoverse.github.com/
28
29	On 4/26/20 9:08 AM, Michał Górny wrote:
30	> Hi,
31	>
32	> The topic of rebooting gentoostats comes here from time to time. Unless
33	> I'm mistaken, all the efforts so far were superficial, lacking a clear
34	> plan and unwilling to research the problems. I'd like to start
35	> a serious discussion focused on the issues we need to solve, and propose
36	> some ideas how we could solve them.
37	>
38	> I can't promise I'll find time to implement it. However, I'd like to
39	> get a clear plan on how it should be done if someone actually does it.
40	>
41	>
42	> The big questions
43	> =================
44	> The way I see it, the primary goal of the project would be to gather
45	> statistics on popularity of packages, in order to help us prioritize our
46	> attention and make decisions on what to keep and what to remove. Unlike
47	> Debian's popcon, I don't think we really want to try to investigate
48	> which files are actually used but focus on what's installed.
49	>
50	> There are a few important questions that need to be answered first:
51	>
52	> 1. Which data do we need to collect?
53	>
54	> a. list of installed packages?
55	> b. versions (or slots?) of installed packages?
56	> c. USE flags on installed packages?
57	> d. world and world_sets files
58	> e. system profile?
59	> f. enabled repositories? (possibly filtered to official list)
60	> g. distribution?
61	>
62	> I think d. is most important as it gives us information on what users
63	> really want. a. alone is kinda redundant is we have d. c. might have
64	> some value when deciding whether to mask a particular flag (and implies
65	> a.).
66	>
67	> e. would be valuable if we wanted to determine the future of particular
68	> profiles, as well as e.g. estimate the transition to new versions.
69	>
70	> f. would be valuable to determine which repositories are used but we
71	> need to filter private repos from the output for privacy reasons.
72	>
73	> g. could be valuable in correlation with other data but not sure if
74	> there's much direct value alone.
75	>
76	>
77	> 2. How to handle Gentoo derivatives? Some of them could provide
78	> meaningful data but some could provide false data (e.g. when derivatives
79	> override Gentoo packages). One possible option would be to filter a.-e.
80	> to stuff coming from ::gentoo.
81	>
82	>
83	> 3. How to keep the data up-to-date? After all, if we just stack a lot
84	> of old data, we will soon stop getting meaningful results. I suppose
85	> we'll need to timestamp all data and remove old entries.
86	>
87	>
88	> 4. How to avoid duplication? If some users submit their results more
89	> often than others, they would bias the results. 3. might be related.
90	>
91	>
92	> 5. How to handle clusters? Things are simple if we can assume that
93	> people will submit data for a few distinct systems. But what about
94	> companies that run 50 Gentoo machines with the same or similar setup?
95	> What about clusters of 1000 almost identical containers? Big entities
96	> could easily bias the results but we should also make it possible for
97	> them to participate somehow.
98	>
99	>
100	> 6. Security. We don't want to expose information that could be
101	> correlated to specific systems, as it could disclose their
102	> vulnerabilities.
103	>
104	>
105	> 7. Privacy. Besides the above, our sysadmins would appreciate if
106	> the data they submitted couldn't be easily correlated to them. If we
107	> don't respect privacy of our users, we won't get them to submit data.
108	>
109	>
110	> 8. Spam protection. Finally, the service needs to be resilient to being
111	> spammed with fake data. Both to users who want to make their packages
112	> look more important, and to script kiddies that want to prove a point.
113	>
114	>
115	> My (partial) implementation idea
116	> ================================
117	> I think our approach should be oriented on privacy/security first,
118	> and attempt to make the best of the data we can get while respecting
119	> this principle. This means no correlation and no tracking.
120	>
121	> Once the tool is installed, the user needs to opt-in to using it. This
122	> involves accepting a privacy policy and setting up a cronjob. The tool
123	> would suggest a (random?) time for submission to take place periodically
124	> (say, every week).
125	>
126	> The submission would contain only raw data, without any identification
127	> information. It would be encrypted using our public key. Once
128	> uploaded, it would be put into our input queue as-is.
129	>
130	> Periodically the input queue would be processed in bulk. The individual
131	> statistics would be updated and the input would be discarded. This
132	> should prevent people trying to correlate changes in statistics with
133	> individual uploads.
134	>
135	> Each counted item would have a timestamp associated, and we'd discard
136	> old items per resubmission period. This should ensure that we keep
137	> fresh data and people can update their earlier submissions without
138	> storing identification data.
139	>
140	> For example, N users submit their data containing a list of packages
141	> every week. This data is used in bulk to update counts of individual
142	> packages (technically, to append timestamps to list corresponding to
143	> these packages). Data older than one week is discarded, so we have
144	> rough counts of package use during the last week.
145	>
146	> I think this addresses problems 3./6./7.
147	>
148	>
149	> The other major problem is spam protection. The best semi-anonymous way
150	> I see is to use submitter's IPv4 addresses (can we support IPv6 then?).
151	> We could set a limit of, say, 10 submissions per IPv4 address per week.
152	> If some address would exceed that limit, we could require CAPTCHA
153	> authorization.
154	>
155	> I think this would make spamming a bit harder while keeping submissions
156	> easy for the most, and a little harder but possible for those of us
157	> behind ISP NATs.
158	>
159	> This should address problems 4./8. and maybe 5. to some degree.
160	>
161	>
162	> A proper solution to cluster problem would probably involve some way to
163	> internally collect and combine data data before submission. If you have
164	> large clusters of similar systems, I think you'd want to have all
165	> packages used on different systems reported as one entry.
166	>
167	>
168	> I think we should collect data from users running all Gentoo
169	> derivatives, as long as they are using Gentoo packages. The simplest
170	> solution I can think of would be to filter the results on packages (or
171	> profiles) installed from ::gentoo. This will work only for distros that
172	> expose ::gentoo explicitly (vs copying our ebuilds to their
173	> repositories) though.
174	>
175	>
176	> What do you think? Do you foresee other problems? Do you have other
177	> needs? Can you think of better solutions?
178	>