Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation - gentoo-dev

From:	Jaco Kroon <jaco@××××××.za>
To:	gentoo-dev@l.g.o, "Michał Górny" <mgorny@g.o>
Subject:	Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
Date:	Tue, 05 May 2020 18:04:36
Message-Id:	`93fdada0-857e-67f4-553b-e4aa81cf3373@uls.co.za`
In Reply to:	[gentoo-dev] [RFC] Ideas for gentoostats implementation by "Michał Górny"

1

Hi Michał, and the rest of the Gentoo devs,

2

3

I've been patiently sitting and watching this discussion.

4

5

I raised some ideas with another developer (Not Michał) just days before

6

he raised this thread to the ML.

7

8

I believe all points raised to this point is valid, I'll try to summarise:

9

10

1.  This must be completely *opt in*.

11

2.  Anonymity was discussed by various parties (privacy).

12

3.  "spam" protection (ie, preventing bogus data from entering).

13

4.  Trustworthiness of data.

14

5.  Acceptance of some form of privacy policy.

15

16

In my opinion, points 2 and 3 works against each other, in that if

17

registration is compulsory if you would like to submit stats, then we

18

can control the spam more easily (not foolproof), but requiring

19

registration also raises the entry barrier.  I'd be completely willing

20

to provide at least an email address as part of a submission.

21

22

All of the replies seems to have focused purely on yes/no, do it or

23

don't.  Not many have addressed the benefits to end users/system

24

administrators.  It seems to focus is on what we as developers can get

25

out of this.

26

27

Regarding the above points:

28

29

1.  I fully agree.  This should not be forced on anyone.

30

2.  Happy to concede that some people may wish to submit anonymously. 

31

Let them.

32

3.  I'll address this below.

33

4.  A lot of the discussion has been around the usefulness of the data,

34

and I concede to Thomas that this may (or may not) generate "decision

35

blind spots" or as per "artificially increase decision certainty".  I

36

don't see how this is worse than what we've got now.

37

5.  We have the infrastructure for this already by way of licenses.  So

38

we ship with "GPLv2/3/whatever + GentooPrivacy", and users have to first

39

take explicit action to accept GentooPrivacy.

40

41

I have some other ideas around this, which will tread even further on

42

privacy, but again, all of this should be a kind of opt-in, and building

43

on the ideas by Kent where he suggested a form of submission proxy

44

(STATS_SERVER), we could potentially give the full benefit of the code

45

to such entities, but then still allow them to submit "upstream" in a

46

more filtered manner.

47

48

Bottom line, in my opinion:  Any data is better than no data!

49

50

Whilst we can't say "no one is using xyz", we will at least be able to

51

say "hey, some people are using xyz", and whilst this may generate some

52

blinds it at least enables us to test known use cases during

53

test-builds, eg, we know for a fact a thousand users are using package X

54

with USE flags "-* a b c", so we should definitely run that as a compile

55

test.  Your build breaks frequently?  Would you mind submitting stats? 

56

Great thank you.  You not willing to do that, then my stance becomes one

57

of "ok, I'll help where I can, but really, please consider us to help

58

you, if you submit stats we can pre-emptively at least include build

59

tests for your specific USE flags." - and again, this means we can

60

actually have our tooling use these stats to generate build tests for

61

the "known popular" configs.

62

63

I point you to RHEL - why are people willing to pay for for RHEL?  What

64

do they get for that buck?  Because I promise you, the support I get

65

from fellow Gentoo'ers FAR outweigh the support I have ever gotten from

66

(paid for) RHEL.  Most of the time.

67

68

I myself used to run 500+ Gentoo hosts more than 15 years back.  It was

69

fun.  I was also a student back then so had much more time on my hands

70

than I do now.  It was challenging, and fun to try and get things to

71

work exactly the way we envisioned it should.  I promise you, if what

72

Michał proposes was available for me back then to firstly keep track of

73

my own internal assets, and to submit stats upstream to help improve

74

Gentoo I would not have hesitated for 10 seconds.

75

76

And there I touch on a point I'm trying to make - this should be

77

something that not only helps devs, but brings benefit to users.  I'll

78

say more on this at the end of the email (possibly force users to run

79

some of their own infra for this at least, but these stats form the

80

framework for a multi-system management system too, potentially).  First

81

I'd like to pay more attention to the individual points raised by Michał.

82

83

On 2020/04/26 10:08, Michał Górny wrote:

84

85

> Hi,

86

>

87

> The topic of rebooting gentoostats comes here from time to time.  Unless

88

> I'm mistaken, all the efforts so far were superficial, lacking a clear

89

> plan and unwilling to research the problems.  I'd like to start

90

> a serious discussion focused on the issues we need to solve, and propose

91

> some ideas how we could solve them.

92

>

93

> I can't promise I'll find time to implement it.  However, I'd like to

94

> get a clear plan on how it should be done if someone actually does it.

95

96

My time is also limited, but I would love to be involved in some way or

97

another.

98

99

> The big questions

100

> =================

101

> The way I see it, the primary goal of the project would be to gather

102

> statistics on popularity of packages, in order to help us prioritize our

103

> attention and make decisions on what to keep and what to remove.  Unlike

104

> Debian's popcon, I don't think we really want to try to investigate

105

> which files are actually used but focus on what's installed.

106

>

107

> There are a few important questions that need to be answered first:

108

>

109

> 1. Which data do we need to collect?

110

>

111

>    a. list of installed packages?

112

>    b. versions (or slots?) of installed packages?

113

>    c. USE flags on installed packages?

114

>    d. world and world_sets files

115

>    e. system profile?

116

>    f. enabled repositories? (possibly filtered to official list)

117

All of the above.  Including exact versions and USE flags for each

118

package.  Also, I'm sure there are others, but I sometimes have systems

119

that fall behind on certain packages, either by no longer being included

120

from world or for other reasons (eg, a specific SLOT that no longer

121

updates for some reason, although this situation has improved).

122

>    g. distribution?

123

/etc/gentoo-release?

124

125

Yes, I think so, that partially deals with your "derivative distributions".

126

127

h.  date+time of last successful emerge --sync (probably individually

128

for each repository).

129

i.  /var/log/emerge.log

130

j.  hardware data, eg, amount of RAM, CPU clock speed/cores, disks.

131

k.  hostname + other network info (IP address).

132

133

i - build failures might be helpful.  Might be useful  to get exact

134

merge times assuming that users want some extra features for user

135

benefit, not gentoo dev benefit.

136

j,k - definitely not of use to devs, but possibly to users as a form of

137

"hardware inventory".

138

139

Much of this is definitely not data that we want/need, but if the data

140

gets proxied, then we and our users can use this as a form of inventory

141

management system too.

142

143

> I think d. is most important as it gives us information on what users

144

> really want.  a. alone is kinda redundant is we have d.  c. might have

145

> some value when deciding whether to mask a particular flag (and implies

146

> a.).

147

>

148

> e. would be valuable if we wanted to determine the future of particular

149

> profiles, as well as e.g. estimate the transition to new versions.

150

>

151

> f. would be valuable to determine which repositories are used but we

152

> need to filter private repos from the output for privacy reasons.

153

I agree with all of this.

154

> g. could be valuable in correlation with other data but not sure if

155

> there's much direct value alone.

156

Don't think so, but see your own point 2.

157

>

158

>

159

> 2. How to handle Gentoo derivatives?  Some of them could provide

160

> meaningful data but some could provide false data (e.g. when derivatives

161

> override Gentoo packages).  One possible option would be to filter a.-e.

162

> to stuff coming from ::gentoo.

163

164

It may be of benefit to know which ::gentoo packages they are using, and

165

if we make the code available to those distributions as a form of

166

proxy/peer, then any hosts that submit directly to Gentoo we could

167

dispatch to that distributions' infra, or if we're really nice, just

168

keep it and strip out the packages we don't maintain (ie, not ::gentoo

169

or official repositories).

170

171

>

172

>

173

> 3. How to keep the data up-to-date?  After all, if we just stack a lot

174

> of old data, we will soon stop getting meaningful results.  I suppose

175

> we'll need to timestamp all data and remove old entries.

176

177

My opinion on this, automated cron, that dispatches daily.  At least

178

weekly.  Daily provides better granularity for some other ideas aimed at

179

system administrators.  Eg, when did what change?  I shove /etc into git

180

for this reason alone with a nightly cron to commit everything and push

181

it to a remote server, also serves as a form of configuration backup.

182

183

>

184

>

185

> 4. How to avoid duplication?  If some users submit their results more

186

> often than others, they would bias the results.  3. might be related.

187

188

I think this directly relate to SPAM.  So I fully agree with the UUID

189

per installation concept.  But then systems get cloned (our labs used to

190

be updated on a single machine, then we utilized udpcast to update the

191

rest of the systems, so they would all end up with the same UUID).  So

192

the primary purpose of this is to find the origin of the installation,

193

but can be trivially bypassed either by force generating a new UUID, or

194

copying from other machines, so this can be trivially manipulated.

195

196

I think we need to add a secondary, hardware based identifier.

197

198

Digium (now Sangoma) checks for all MAC addresses for ethX, starting

199

from 0 until the ioctl gets a failure, if eth0 fails, it basically does

200

"ip ad sh" and end up including the same MAC multiple times, and in

201

arbitrary order since the NICs aren't guaranteed to be detected in the

202

same order on every boot.  This (or a related) method could work, so

203

generate some unique hardware-based identifier, then hash it using say

204

SHA-256 or BLAKE2 to generate something which can't be trivially

205

reversed back to the original identifier?  Why ... well, anonymity :). 

206

We could even include the configured or dhcp obtained hostname into this.

207

208

> 5. How to handle clusters?  Things are simple if we can assume that

209

> people will submit data for a few distinct systems.  But what about

210

> companies that run 50 Gentoo machines with the same or similar setup?

211

> What about clusters of 1000 almost identical containers?  Big entities

212

> could easily bias the results but we should also make it possible for

213

> them to participate somehow.

214

Assuming they do what we did ... they'd probably (hopefully) all end up

215

with the same (installation time?) UUID but different hardware

216

identifiers.  So we'd be able to identify them ... and enterprise idea,

217

report back to those admins (assuming they registered these systems to

218

their profile) that their clusters have discrepancies.

219

>

220

>

221

> 6. Security.  We don't want to expose information that could be

222

> correlated to specific systems, as it could disclose their

223

> vulnerabilities.

224

225

Agreed.  But some of this may have particular benefit for system

226

administrators, so perhaps a secondary level of opt-in for providing

227

"potentially sensitive data" if the Gentoo infra gets compromised.  We

228

could perhaps store a raw blob for these users that only gets decrypted

229

by some key that only they should have/poses.

230

231

Or, we could proxy the data, let the sensitive stuff travel to the

232

proxy/aggregator, and strip that from going higher up.  And they simply

233

generate those reports locally on their proxy/aggregator.

234

235

>

236

>

237

> 7. Privacy.  Besides the above, our sysadmins would appreciate if

238

> the data they submitted couldn't be easily correlated to them.  If we

239

> don't respect privacy of our users, we won't get them to submit data.

240

241

I'm happy with either blind UUID + HW-related-hash submission, without

242

any further data, but would really appreciate if users are willing to

243

register.  This would have the following benefits IMHO:

244

245

They could subscribe for news items that affects them.

246

They could subscribe for receiving GLSAs for packages that affect their

247

systems.

248

They could get a view of all their systems from a central "management"

249

interface.

250

251

I have a need to be able to ask the asterisk users on Gentoo what they

252

need/want.  As it stands, I'm suffering from "user blindness".  Again, I

253

have my own needs, and I scratch those, but helping others to get their

254

needs scratched is a good thing.  If you don't want to participate,

255

that's fine, but if you do, you get to reap the benefit.  Towards this

256

end, and perhaps enabling some users to provide some feedback a further

257

future step may be to enable users to anonymously submit requests via

258

the system.  Or we could get anonymous feedback from users from whom

259

we'd normally not get any.  So if the core infra on this has email

260

addresses for all users, it could send out the email on-behalf-of the

261

package maintainer, and feedback could then be submitted via some

262

anonymous mechanism (eg, link in email that takes the user to a

263

submissions page, and we explicitly don't encode per-recipient

264

cookie-style data into the link).  An idea.

265

266

>

267

>

268

> 8. Spam protection.  Finally, the service needs to be resilient to being

269

> spammed with fake data.  Both to users who want to make their packages

270

> look more important, and to script kiddies that want to prove a point.

271

272

Data only gets included after being kept up to date for a period of at

273

least X days.  Based on generated UUID + HW-Hash.  UUID is (optionally

274

but ideally) linked to a user profile.  HW-Hash is just to identify

275

unique systems.

276

277

Data that doens't get kept up to date could be filtered out after Y

278

days, where Y <= X.  That way a spammer would at least need to take the

279

effort of keeping his spamming effort going for X number of days with X

280

number of unique (trivially spoofable) identifiers.  So we don't deny

281

that it can be done, I'm just not sure we care?

282

283

Other than me, who would benefit to spoof stats for asterisk for

284

example?  Perhaps someone with a grudge?  But they have my email address

285

anyway ... so can do far worse than generate a few spoofed submissions.

286

287

> My (partial) implementation idea

288

> ================================

289

> I think our approach should be oriented on privacy/security first,

290

> and attempt to make the best of the data we can get while respecting

291

> this principle.  This means no correlation and no tracking.

292

293

I both agree and disagree.  The most basic premise should be no

294

tracking/correlation unless the user specifically request it towards

295

specific functionality (eg, emailing of affecting GLSAs/news items,

296

single-platform for viewing my hosts and what their status are).

297

298

> Once the tool is installed, the user needs to opt-in to using it.  This

299

> involves accepting a privacy policy and setting up a cronjob.  The tool

300

> would suggest a (random?) time for submission to take place periodically

301

> (say, every week).

302

303

As above, I'd do this as part of accepting a license that states by

304

accepting this license you accept the most basic submission of stats in

305

an anonymous manner including only the most basic of identifier

306

information to identify unique systems.

307

308

> The submission would contain only raw data, without any identification

309

> information.  It would be encrypted using our public key.  Once

310

> uploaded, it would be put into our input queue as-is.

311

312

Correct.  Explicit action required to register UUID to user profile.  If

313

that is an option.

314

315

Eg, gentoo-stat --link-to jaco@×××××××.za

316

317

Then prompt for my password, which I then need to enter in order to link

318

the UUID of the current system to my registered profile.

319

320

So completely anonymous, with minimum data, unless specifically

321

configured otherwise.

322

323

>

324

> Periodically the input queue would be processed in bulk.  The individual

325

> statistics would be updated and the input would be discarded.  This

326

> should prevent people trying to correlate changes in statistics with

327

> individual uploads.

328

329

Ok.  This makes makes sense.  As a sysadmin I'd like that data to be

330

available for say 30 to 60 or even 90 days, or at least "what changed

331

from submission X to X+1 spanning the period", because then if something

332

breaks, I can ask "when did it break?" and then I can ask the stats

333

system "what changed on the related systems around that time?".  At

334

Gentoo core infra level, we can potentially discard as  soon as

335

processed, but depending on the algorihm we may need to keep at least

336

the latest submitted copy for Y number of days (as defined above).

337

338

Ok, yes, I can do that by working through /var/log/emerge.log as well,

339

or genlop -l, but I need to do that system by system.  If I have an

340

environment of 500 hosts this gets tedious.  Or what if I'd like to find

341

what differs between a set of hosts where a feature X works, and others

342

that don't?

343

344

>

345

> What do you think?  Do you foresee other problems?  Do you have other

346

> needs?  Can you think of better solutions?

347

>

348

I think we should build a hierarchy.  So Gentoo-infra at the top.  End

349

users may submit only certain types of data there, all other data we as

350

devs don't care about gets discarded, and if we allow users to register

351

there directly we limit the functionality thereof in order to maintain

352

the requirements of the developers here first and foremost.

353

354

As such, the submitted package should be based on "data sets" in my

355

opinion, where the most basic sets could be:

356

357

core:

358

  a) package list including versions and use flags

359

  b) world and world_sets

360

  c) uuid

361

  d) hash(hardware ident)

362

363

hardware:

364

  a) RAM

365

  b) ...

366

367

network:

368

  a) ...

369

370

At the Gentoo-infra layer we can then have a policy that we ONLY accept

371

"core" sets.  If it's easy to at the proxy/aggregator level define your

372

own sets, and provide mechanisms to obtain the data (or as plugins on

373

the hosts themselves, eg, USE="hardware network" gentoo-stats-plugins

374

style, with the main package only containing what the devs need.  Just

375

ideas.

376

377

Further down the hierarchy additional sets could be defined, and

378

proxy/aggregator hosts could define what information they allow higher

379

up the hierarchy.

380

381

If we receive information for a gentoo derivate we redirect it to that

382

distribution.  Although for such a case we really should provide a way

383

for derivatives to specify their own "default" infra.

384

385

Other projects can then build on top of, or as plug-ins of the core

386

stats project to then provide the more enterprise-like features.  One

387

could potentially even go as far as automated updating driven from a

388

central control server in a networked environment where the

389

proxy/aggregator is able to connect back to the individual hosts to

390

execute commands on them.

391

392

I sincerely hope my ramblings haven't been completely off point.  I

393

believe the above shows that this can be of benefit to users and

394

developers alike, and hopefully in a way that does not infringe on user

395

users' rights or privacy.

396

397

One thing could be for aggregators to submit aggregated stats instead of

398

individual systems, again, same X and Y stuff would apply, however, I

399

think for aggregated submissions the data skew risk becomes even

400

larger.  So perhaps we should provide two sets of stats "excluding

401

aggregated stats" and including, or possibly we can mark some

402

aggregators as trusted.  I dunno.

403

404

Kind Regards,

405

Jaco

1	Hi Michał, and the rest of the Gentoo devs,
2
3	I've been patiently sitting and watching this discussion.
4
5	I raised some ideas with another developer (Not Michał) just days before
6	he raised this thread to the ML.
7
8	I believe all points raised to this point is valid, I'll try to summarise:
9
10	1. This must be completely opt in.
11	2. Anonymity was discussed by various parties (privacy).
12	3. "spam" protection (ie, preventing bogus data from entering).
13	4. Trustworthiness of data.
14	5. Acceptance of some form of privacy policy.
15
16	In my opinion, points 2 and 3 works against each other, in that if
17	registration is compulsory if you would like to submit stats, then we
18	can control the spam more easily (not foolproof), but requiring
19	registration also raises the entry barrier. I'd be completely willing
20	to provide at least an email address as part of a submission.
21
22	All of the replies seems to have focused purely on yes/no, do it or
23	don't. Not many have addressed the benefits to end users/system
24	administrators. It seems to focus is on what we as developers can get
25	out of this.
26
27	Regarding the above points:
28
29	1. I fully agree. This should not be forced on anyone.
30	2. Happy to concede that some people may wish to submit anonymously.
31	Let them.
32	3. I'll address this below.
33	4. A lot of the discussion has been around the usefulness of the data,
34	and I concede to Thomas that this may (or may not) generate "decision
35	blind spots" or as per "artificially increase decision certainty". I
36	don't see how this is worse than what we've got now.
37	5. We have the infrastructure for this already by way of licenses. So
38	we ship with "GPLv2/3/whatever + GentooPrivacy", and users have to first
39	take explicit action to accept GentooPrivacy.
40
41	I have some other ideas around this, which will tread even further on
42	privacy, but again, all of this should be a kind of opt-in, and building
43	on the ideas by Kent where he suggested a form of submission proxy
44	(STATS_SERVER), we could potentially give the full benefit of the code
45	to such entities, but then still allow them to submit "upstream" in a
46	more filtered manner.
47
48	Bottom line, in my opinion: Any data is better than no data!
49
50	Whilst we can't say "no one is using xyz", we will at least be able to
51	say "hey, some people are using xyz", and whilst this may generate some
52	blinds it at least enables us to test known use cases during
53	test-builds, eg, we know for a fact a thousand users are using package X
54	with USE flags "-* a b c", so we should definitely run that as a compile
55	test. Your build breaks frequently? Would you mind submitting stats?
56	Great thank you. You not willing to do that, then my stance becomes one
57	of "ok, I'll help where I can, but really, please consider us to help
58	you, if you submit stats we can pre-emptively at least include build
59	tests for your specific USE flags." - and again, this means we can
60	actually have our tooling use these stats to generate build tests for
61	the "known popular" configs.
62
63	I point you to RHEL - why are people willing to pay for for RHEL? What
64	do they get for that buck? Because I promise you, the support I get
65	from fellow Gentoo'ers FAR outweigh the support I have ever gotten from
66	(paid for) RHEL. Most of the time.
67
68	I myself used to run 500+ Gentoo hosts more than 15 years back. It was
69	fun. I was also a student back then so had much more time on my hands
70	than I do now. It was challenging, and fun to try and get things to
71	work exactly the way we envisioned it should. I promise you, if what
72	Michał proposes was available for me back then to firstly keep track of
73	my own internal assets, and to submit stats upstream to help improve
74	Gentoo I would not have hesitated for 10 seconds.
75
76	And there I touch on a point I'm trying to make - this should be
77	something that not only helps devs, but brings benefit to users. I'll
78	say more on this at the end of the email (possibly force users to run
79	some of their own infra for this at least, but these stats form the
80	framework for a multi-system management system too, potentially). First
81	I'd like to pay more attention to the individual points raised by Michał.
82
83	On 2020/04/26 10:08, Michał Górny wrote:
84
85	> Hi,
86	>
87	> The topic of rebooting gentoostats comes here from time to time. Unless
88	> I'm mistaken, all the efforts so far were superficial, lacking a clear
89	> plan and unwilling to research the problems. I'd like to start
90	> a serious discussion focused on the issues we need to solve, and propose
91	> some ideas how we could solve them.
92	>
93	> I can't promise I'll find time to implement it. However, I'd like to
94	> get a clear plan on how it should be done if someone actually does it.
95
96	My time is also limited, but I would love to be involved in some way or
97	another.
98
99	> The big questions
100	> =================
101	> The way I see it, the primary goal of the project would be to gather
102	> statistics on popularity of packages, in order to help us prioritize our
103	> attention and make decisions on what to keep and what to remove. Unlike
104	> Debian's popcon, I don't think we really want to try to investigate
105	> which files are actually used but focus on what's installed.
106	>
107	> There are a few important questions that need to be answered first:
108	>
109	> 1. Which data do we need to collect?
110	>
111	> a. list of installed packages?
112	> b. versions (or slots?) of installed packages?
113	> c. USE flags on installed packages?
114	> d. world and world_sets files
115	> e. system profile?
116	> f. enabled repositories? (possibly filtered to official list)
117	All of the above. Including exact versions and USE flags for each
118	package. Also, I'm sure there are others, but I sometimes have systems
119	that fall behind on certain packages, either by no longer being included
120	from world or for other reasons (eg, a specific SLOT that no longer
121	updates for some reason, although this situation has improved).
122	> g. distribution?
123	/etc/gentoo-release?
124
125	Yes, I think so, that partially deals with your "derivative distributions".
126
127	h. date+time of last successful emerge --sync (probably individually
128	for each repository).
129	i. /var/log/emerge.log
130	j. hardware data, eg, amount of RAM, CPU clock speed/cores, disks.
131	k. hostname + other network info (IP address).
132
133	i - build failures might be helpful. Might be useful to get exact
134	merge times assuming that users want some extra features for user
135	benefit, not gentoo dev benefit.
136	j,k - definitely not of use to devs, but possibly to users as a form of
137	"hardware inventory".
138
139	Much of this is definitely not data that we want/need, but if the data
140	gets proxied, then we and our users can use this as a form of inventory
141	management system too.
142
143	> I think d. is most important as it gives us information on what users
144	> really want. a. alone is kinda redundant is we have d. c. might have
145	> some value when deciding whether to mask a particular flag (and implies
146	> a.).
147	>
148	> e. would be valuable if we wanted to determine the future of particular
149	> profiles, as well as e.g. estimate the transition to new versions.
150	>
151	> f. would be valuable to determine which repositories are used but we
152	> need to filter private repos from the output for privacy reasons.
153	I agree with all of this.
154	> g. could be valuable in correlation with other data but not sure if
155	> there's much direct value alone.
156	Don't think so, but see your own point 2.
157	>
158	>
159	> 2. How to handle Gentoo derivatives? Some of them could provide
160	> meaningful data but some could provide false data (e.g. when derivatives
161	> override Gentoo packages). One possible option would be to filter a.-e.
162	> to stuff coming from ::gentoo.
163
164	It may be of benefit to know which ::gentoo packages they are using, and
165	if we make the code available to those distributions as a form of
166	proxy/peer, then any hosts that submit directly to Gentoo we could
167	dispatch to that distributions' infra, or if we're really nice, just
168	keep it and strip out the packages we don't maintain (ie, not ::gentoo
169	or official repositories).
170
171	>
172	>
173	> 3. How to keep the data up-to-date? After all, if we just stack a lot
174	> of old data, we will soon stop getting meaningful results. I suppose
175	> we'll need to timestamp all data and remove old entries.
176
177	My opinion on this, automated cron, that dispatches daily. At least
178	weekly. Daily provides better granularity for some other ideas aimed at
179	system administrators. Eg, when did what change? I shove /etc into git
180	for this reason alone with a nightly cron to commit everything and push
181	it to a remote server, also serves as a form of configuration backup.
182
183	>
184	>
185	> 4. How to avoid duplication? If some users submit their results more
186	> often than others, they would bias the results. 3. might be related.
187
188	I think this directly relate to SPAM. So I fully agree with the UUID
189	per installation concept. But then systems get cloned (our labs used to
190	be updated on a single machine, then we utilized udpcast to update the
191	rest of the systems, so they would all end up with the same UUID). So
192	the primary purpose of this is to find the origin of the installation,
193	but can be trivially bypassed either by force generating a new UUID, or
194	copying from other machines, so this can be trivially manipulated.
195
196	I think we need to add a secondary, hardware based identifier.
197
198	Digium (now Sangoma) checks for all MAC addresses for ethX, starting
199	from 0 until the ioctl gets a failure, if eth0 fails, it basically does
200	"ip ad sh" and end up including the same MAC multiple times, and in
201	arbitrary order since the NICs aren't guaranteed to be detected in the
202	same order on every boot. This (or a related) method could work, so
203	generate some unique hardware-based identifier, then hash it using say
204	SHA-256 or BLAKE2 to generate something which can't be trivially
205	reversed back to the original identifier? Why ... well, anonymity :).
206	We could even include the configured or dhcp obtained hostname into this.
207
208	> 5. How to handle clusters? Things are simple if we can assume that
209	> people will submit data for a few distinct systems. But what about
210	> companies that run 50 Gentoo machines with the same or similar setup?
211	> What about clusters of 1000 almost identical containers? Big entities
212	> could easily bias the results but we should also make it possible for
213	> them to participate somehow.
214	Assuming they do what we did ... they'd probably (hopefully) all end up
215	with the same (installation time?) UUID but different hardware
216	identifiers. So we'd be able to identify them ... and enterprise idea,
217	report back to those admins (assuming they registered these systems to
218	their profile) that their clusters have discrepancies.
219	>
220	>
221	> 6. Security. We don't want to expose information that could be
222	> correlated to specific systems, as it could disclose their
223	> vulnerabilities.
224
225	Agreed. But some of this may have particular benefit for system
226	administrators, so perhaps a secondary level of opt-in for providing
227	"potentially sensitive data" if the Gentoo infra gets compromised. We
228	could perhaps store a raw blob for these users that only gets decrypted
229	by some key that only they should have/poses.
230
231	Or, we could proxy the data, let the sensitive stuff travel to the
232	proxy/aggregator, and strip that from going higher up. And they simply
233	generate those reports locally on their proxy/aggregator.
234
235	>
236	>
237	> 7. Privacy. Besides the above, our sysadmins would appreciate if
238	> the data they submitted couldn't be easily correlated to them. If we
239	> don't respect privacy of our users, we won't get them to submit data.
240
241	I'm happy with either blind UUID + HW-related-hash submission, without
242	any further data, but would really appreciate if users are willing to
243	register. This would have the following benefits IMHO:
244
245	They could subscribe for news items that affects them.
246	They could subscribe for receiving GLSAs for packages that affect their
247	systems.
248	They could get a view of all their systems from a central "management"
249	interface.
250
251	I have a need to be able to ask the asterisk users on Gentoo what they
252	need/want. As it stands, I'm suffering from "user blindness". Again, I
253	have my own needs, and I scratch those, but helping others to get their
254	needs scratched is a good thing. If you don't want to participate,
255	that's fine, but if you do, you get to reap the benefit. Towards this
256	end, and perhaps enabling some users to provide some feedback a further
257	future step may be to enable users to anonymously submit requests via
258	the system. Or we could get anonymous feedback from users from whom
259	we'd normally not get any. So if the core infra on this has email
260	addresses for all users, it could send out the email on-behalf-of the
261	package maintainer, and feedback could then be submitted via some
262	anonymous mechanism (eg, link in email that takes the user to a
263	submissions page, and we explicitly don't encode per-recipient
264	cookie-style data into the link). An idea.
265
266	>
267	>
268	> 8. Spam protection. Finally, the service needs to be resilient to being
269	> spammed with fake data. Both to users who want to make their packages
270	> look more important, and to script kiddies that want to prove a point.
271
272	Data only gets included after being kept up to date for a period of at
273	least X days. Based on generated UUID + HW-Hash. UUID is (optionally
274	but ideally) linked to a user profile. HW-Hash is just to identify
275	unique systems.
276
277	Data that doens't get kept up to date could be filtered out after Y
278	days, where Y <= X. That way a spammer would at least need to take the
279	effort of keeping his spamming effort going for X number of days with X
280	number of unique (trivially spoofable) identifiers. So we don't deny
281	that it can be done, I'm just not sure we care?
282
283	Other than me, who would benefit to spoof stats for asterisk for
284	example? Perhaps someone with a grudge? But they have my email address
285	anyway ... so can do far worse than generate a few spoofed submissions.
286
287	> My (partial) implementation idea
288	> ================================
289	> I think our approach should be oriented on privacy/security first,
290	> and attempt to make the best of the data we can get while respecting
291	> this principle. This means no correlation and no tracking.
292
293	I both agree and disagree. The most basic premise should be no
294	tracking/correlation unless the user specifically request it towards
295	specific functionality (eg, emailing of affecting GLSAs/news items,
296	single-platform for viewing my hosts and what their status are).
297
298	> Once the tool is installed, the user needs to opt-in to using it. This
299	> involves accepting a privacy policy and setting up a cronjob. The tool
300	> would suggest a (random?) time for submission to take place periodically
301	> (say, every week).
302
303	As above, I'd do this as part of accepting a license that states by
304	accepting this license you accept the most basic submission of stats in
305	an anonymous manner including only the most basic of identifier
306	information to identify unique systems.
307
308	> The submission would contain only raw data, without any identification
309	> information. It would be encrypted using our public key. Once
310	> uploaded, it would be put into our input queue as-is.
311
312	Correct. Explicit action required to register UUID to user profile. If
313	that is an option.
314
315	Eg, gentoo-stat --link-to jaco@×××××××.za
316
317	Then prompt for my password, which I then need to enter in order to link
318	the UUID of the current system to my registered profile.
319
320	So completely anonymous, with minimum data, unless specifically
321	configured otherwise.
322
323	>
324	> Periodically the input queue would be processed in bulk. The individual
325	> statistics would be updated and the input would be discarded. This
326	> should prevent people trying to correlate changes in statistics with
327	> individual uploads.
328
329	Ok. This makes makes sense. As a sysadmin I'd like that data to be
330	available for say 30 to 60 or even 90 days, or at least "what changed
331	from submission X to X+1 spanning the period", because then if something
332	breaks, I can ask "when did it break?" and then I can ask the stats
333	system "what changed on the related systems around that time?". At
334	Gentoo core infra level, we can potentially discard as soon as
335	processed, but depending on the algorihm we may need to keep at least
336	the latest submitted copy for Y number of days (as defined above).
337
338	Ok, yes, I can do that by working through /var/log/emerge.log as well,
339	or genlop -l, but I need to do that system by system. If I have an
340	environment of 500 hosts this gets tedious. Or what if I'd like to find
341	what differs between a set of hosts where a feature X works, and others
342	that don't?
343
344	>
345	> What do you think? Do you foresee other problems? Do you have other
346	> needs? Can you think of better solutions?
347	>
348	I think we should build a hierarchy. So Gentoo-infra at the top. End
349	users may submit only certain types of data there, all other data we as
350	devs don't care about gets discarded, and if we allow users to register
351	there directly we limit the functionality thereof in order to maintain
352	the requirements of the developers here first and foremost.
353
354	As such, the submitted package should be based on "data sets" in my
355	opinion, where the most basic sets could be:
356
357	core:
358	a) package list including versions and use flags
359	b) world and world_sets
360	c) uuid
361	d) hash(hardware ident)
362
363	hardware:
364	a) RAM
365	b) ...
366
367	network:
368	a) ...
369
370	At the Gentoo-infra layer we can then have a policy that we ONLY accept
371	"core" sets. If it's easy to at the proxy/aggregator level define your
372	own sets, and provide mechanisms to obtain the data (or as plugins on
373	the hosts themselves, eg, USE="hardware network" gentoo-stats-plugins
374	style, with the main package only containing what the devs need. Just
375	ideas.
376
377	Further down the hierarchy additional sets could be defined, and
378	proxy/aggregator hosts could define what information they allow higher
379	up the hierarchy.
380
381	If we receive information for a gentoo derivate we redirect it to that
382	distribution. Although for such a case we really should provide a way
383	for derivatives to specify their own "default" infra.
384
385	Other projects can then build on top of, or as plug-ins of the core
386	stats project to then provide the more enterprise-like features. One
387	could potentially even go as far as automated updating driven from a
388	central control server in a networked environment where the
389	proxy/aggregator is able to connect back to the individual hosts to
390	execute commands on them.
391
392	I sincerely hope my ramblings haven't been completely off point. I
393	believe the above shows that this can be of benefit to users and
394	developers alike, and hopefully in a way that does not infringe on user
395	users' rights or privacy.
396
397	One thing could be for aggregators to submit aggregated stats instead of
398	individual systems, again, same X and Y stuff would apply, however, I
399	think for aggregated submissions the data skew risk becomes even
400	larger. So perhaps we should provide two sets of stats "excluding
401	aggregated stats" and including, or possibly we can mark some
402	aggregators as trusted. I dunno.
403
404	Kind Regards,
405	Jaco

Gentoo Archives: gentoo-dev