Re: [gentoo-project] Call for agenda items - Council meeting 2016-08-14 - gentoo-project

From:	Kent Fredric <kentnl@g.o>
To:	gentoo-project@l.g.o
Subject:	Re: [gentoo-project] Call for agenda items - Council meeting 2016-08-14
Date:	Tue, 09 Aug 2016 05:33:39
Message-Id:	`20160809173255.0ddfa090@katipo2.lan`
In Reply to:	Re: [gentoo-project] Call for agenda items - Council meeting 2016-08-14 by Jack Morgan

1

On Mon, 8 Aug 2016 19:07:04 -0700

2

Jack Morgan <jmorgan@g.o> wrote:

3

4

> On 08/08/16 05:35, Marek Szuba wrote:

5

> >

6

> > Bottom line: I would say we do need some way of streamlining ebuild

7

> > stabilisation.

8

>

9

> I vote we fix this problem. I'm tired of having this same discussion

10

> ever 6 or 12 months. I'd like to see less policy discussion and more

11

> technical solutions to the problems we face.

12

>

13

> I propose calling for volunteers to create a new project that works on

14

> solving our stabilization problem. I see that looking like the

15

> following:

16

>

17

> 1) project identifies the problem(s) with real data from Bugzilla and

18

> the portage tree.

19

>

20

> 2) new project defines a technical proposal to fixing this issue, then

21

> presents it to the developer community for feedback. This would

22

> include defining tools needed or used

23

>

24

> 3) start working on solution + define future roadmap

25

>

26

>

27

> All processes and policies should be on the table for negotiating in

28

> the potential solution. If we need to reinvent the wheel, then let's

29

> do it.

30

>

31

> To be honest, adding more policy just ends up making everyone unhappy

32

> one way or the other.

33

>

34

>

35

36

There's a potential way to garner a technical solution that somewhat

37

alleviates the need for such rigourous arch testers, and without

38

degrading the stabilisation mechanic to "blind monkey system that

39

stabilises based on conjecture".

40

41

I've mentioned it before ages ago on the Gentoo Dev list, somewhere.

42

43

The idea is basically to instrument portage to have an (optional)

44

feature that when turned on, records and submits certain facts about

45

every failed or successful install, with the objective being to

46

essentially spread the load out of what `tatt` does organically over

47

the participant base.

48

49

1. Firstly, make no demands of homoegenity or even sanity for a users

50

system to participate. Ever thing they throw at this system I'm about

51

to propose should be considered "valid"

52

53

2. Every time a package is installed, or attempted to be installed, the

54

exit of that installation is qualified in one of a number of ways:

55

56

   - installed OK without tests

57

   - installed OK with tests

58

   - failed tests

59

   - failed install

60

   - failed compile 

61

   - failed configure

62

63

Each of these is a single state in a single field.

64

65

3. The Name, Version, and SHA1 of the ebuild that generated the report.

66

67

68

4. The USE flags and any other pertinent ( and carefully selected by

69

Gentoo ) flags are included, each as single fields in a property set,

70

and decomposed into structured property lists where possible.

71

72

5. <arch> satisfaction data for the target package at the time of

73

installation is recorded.

74

75

eg:

76

77

   KEYWORDS="arch"  + ACCEPT_KEYWORDS="~arch" -> [ "arch(~)"  ]

78

   KEYWORDS="~arch" + ACCEPT_KEYWORDS="~arch" -> [ "~arch(~)" ]

79

   KEYWORDS="arch"  + ACCEPT_KEYWORDS="arch"  -> [ "arch"     ]

80

   KEYWORDS=""      + ACCEPT_KEYWORDS="**"    -> [ "(**)"     ]

81

82

This seems redundant, but this is basically suggesting "hey, if you're

83

insane and setting lots of different arches for accept keywords, that

84

would be relevant data to use to ignore your report. This data can also

85

be used with other data I'll mention later to isolate users with "mixed

86

keywording" setups.

87

88

6. For every dependency listed in *DEPEND, a dictionary/hash of

89

90

  "specified atom" -> {

91

     name -> resolved dependency name

92

     version -> version of resolved dependency

93

     arch -> [ satisfied arch spec as in #4 ]

94

     sha1 -> Some kind of SHA1 that hopefully turns up in gentoo.git

95

}

96

97

98

is recorded in the response at the time of the result.

99

100

The "satisified arch spec" field is used to isolate anomalies in

101

keywording and user keyword mixing and filter out non-target reports

102

for stabilization data.

103

104

7. A Submitter Unique Identifier

105

106

8. Possibly a Submitter-Machine Unique Identifier.

107

108

9. The whole build log will be included compressed, verbatim.

109

110

This latter part will an independent option to the "reporting" feature,

111

because its a slightly more invasive privacy concern than the others,

112

in that, arbitrary code execution can leak private data.

113

114

Hence, people who turn this feature on have to know what they're

115

signing up for.

116

117

10. All of the above data is pooled and shipped as a single report, and

118

submitted to a "report server" and aggregated.

119

120

121

With all of the above, in the most native of situations, we can use

122

that data at very least to give us a lot more assurance than "well, 30

123

days passed, and nobody complained", because we'll have a paper trail

124

of a known countable number of successful installs, which while

125

not representative, are likely to still be more diverse and

126

reassuring of confidence than the deafening silence of no

127

feedback.

128

129

And in non-naive situations, the results for given versions can be

130

aggregated and compared, and factors that are present can be correlated

131

with failures statistically.

132

133

And this would give us a status board of "here's a bunch of

134

configurations that seem to be statisically more problematic than

135

others, might be worth investigating"

136

137

But there would be no burden to actually dive into the logs unless you

138

found clusters of failures from different sources failing under the

139

same scenarios ( And this is why not everyone *has* to send build logs

140

to be effective, just enough people have to report "x configuration

141

bad" and some subset of them have to provide elucidating logs ).

142

143

None of what I mention here is conceptually "new", I've just

144

re-explained the entire CPAN Testers model in terms relevant to Gentoo,

145

using Gentoo parts instead of CPAN parts.

146

147

And CPAN testers find it *very effective* at being assured they didn't

148

break anything: They ship a TRIAL release ( akin to our ~arch ), and

149

then wait a week or so while people download and test it.

150

151

And pretty much anyone can become "a tester", there's no barrier to

152

entry, and no requirements for membership. Just install the tools, get

153

yourself an ID, and start installing stuff with tests (the default),

154

and the tools you have will automatically fire off those reports to the

155

hive, and you get a big pretty matrix of "We're good here", and then

156

after no red results in some period, they go "hey, yep, we're good" and

157

ship a stable release.

158

159

Or maybe occasional pockets of "you dun goofed" where there will be a

160

problem you might have to look into ( sometimes those problems are

161

entirely invalid problems, ... this is somehow typically not an issue )

162

163

http://matrix.cpantesters.org/?dist=App-perlbrew+0.76

164

165

And you throw variants analysis into the mix and you get those other

166

facts compared and ranked by "Likelihood to be part of the problem"

167

168

http://analysis.cpantesters.org/solved?distv=App-perlbrew-0.76

169

170

^ you see here variant analysis found 3 common strings in the logs that

171

indicated a failure, and it pointed the finger directly at the failing

172

test as a result. And then in rank #3, you see its pointing a finger at

173

CPAN::Perl::Releases as "a possible problem highly correlated with

174

failures" with the -0.5 theta on version 2.88 

175

176

Lo and behold, automated differential analysis has found the bug: 

177

178

https://rt.cpan.org/Ticket/Display.html?id=116517

179

180

It still takes a human to 

181

182

a) decide to look

183

b) decide the differential factors are useful enough to pursue 

184

c) verify the problem manually by using the guidance given

185

d) manually file the bug

186

187

But the point here is we can actually build some infrastructure that

188

will give automated tooling some degree of assurance that "this can

189

probably be safely stabilized now, the testers aren't seeing any issues"

190

191

Its just also the sort of data collection that can lend itself to much

192

more powerful benefits as well.

193

194

The only hard parts are:

195

196

1. Making a good server to handle these reports that scales well

197

2. Making a good client for report generation, collection from PORTAGE

198

and submission

199

3. Getting people to turn on the feature

200

4. Getting enough people using the feature that the majority of the

201

"easy" stabilizations can happen hands-free. 

202

203

And we don't even have to do the "Fancy" parts of it now:

204

205

 Just pools of  "package:  arch = 100pass/0fail archb = 10pass/0  fail" 

206

207

Would be a great start.

208

209

Because otherwise we're relying 100% on negative feedback, and assuming

210

that the absence of negative feedback is positive, when the reality

211

might be closer that the absence of negative feedback is that the

212

problems were too confusing to report as an explicit bug, the problems

213

faced were deemed unimportant to the person in question and they gave

214

up before they reported it, the user encountered some other entry

215

barrier in reporting, ..... or maybe, nobody is actually using the

216

package at all, so it could actually be completely broken and nobody

217

notices.

218

219

And it seems entirely hap-hazard to encourage tooling that not

220

*builds* upon that assumption.

221

222

At least with the manual stabilization process, you can be assured that

223

at least one human will personally install, test, and verify a package

224

works in at least one situation.

225

226

With a completely automated stabilization that relies on the absence of

227

negative feedback to stabilize, you're *not even getting that*.

228

229

Why bother with stabilization at all if the entire thing is merely

230

*conjecture* ?

231

232

Even a broken, flawed stabilization workflow done by teams of people

233

who are bad at testing is better than a stabilization workflow

234

implemented on conjecture of stability :P

Gentoo Archives: gentoo-project

Replies

1	On Mon, 8 Aug 2016 19:07:04 -0700
2	Jack Morgan <jmorgan@g.o> wrote:
3
4	> On 08/08/16 05:35, Marek Szuba wrote:
5	> >
6	> > Bottom line: I would say we do need some way of streamlining ebuild
7	> > stabilisation.
8	>
9	> I vote we fix this problem. I'm tired of having this same discussion
10	> ever 6 or 12 months. I'd like to see less policy discussion and more
11	> technical solutions to the problems we face.
12	>
13	> I propose calling for volunteers to create a new project that works on
14	> solving our stabilization problem. I see that looking like the
15	> following:
16	>
17	> 1) project identifies the problem(s) with real data from Bugzilla and
18	> the portage tree.
19	>
20	> 2) new project defines a technical proposal to fixing this issue, then
21	> presents it to the developer community for feedback. This would
22	> include defining tools needed or used
23	>
24	> 3) start working on solution + define future roadmap
25	>
26	>
27	> All processes and policies should be on the table for negotiating in
28	> the potential solution. If we need to reinvent the wheel, then let's
29	> do it.
30	>
31	> To be honest, adding more policy just ends up making everyone unhappy
32	> one way or the other.
33	>
34	>
35
36	There's a potential way to garner a technical solution that somewhat
37	alleviates the need for such rigourous arch testers, and without
38	degrading the stabilisation mechanic to "blind monkey system that
39	stabilises based on conjecture".
40
41	I've mentioned it before ages ago on the Gentoo Dev list, somewhere.
42
43	The idea is basically to instrument portage to have an (optional)
44	feature that when turned on, records and submits certain facts about
45	every failed or successful install, with the objective being to
46	essentially spread the load out of what `tatt` does organically over
47	the participant base.
48
49	1. Firstly, make no demands of homoegenity or even sanity for a users
50	system to participate. Ever thing they throw at this system I'm about
51	to propose should be considered "valid"
52
53	2. Every time a package is installed, or attempted to be installed, the
54	exit of that installation is qualified in one of a number of ways:
55
56	- installed OK without tests
57	- installed OK with tests
58	- failed tests
59	- failed install
60	- failed compile
61	- failed configure
62
63	Each of these is a single state in a single field.
64
65	3. The Name, Version, and SHA1 of the ebuild that generated the report.
66
67
68	4. The USE flags and any other pertinent ( and carefully selected by
69	Gentoo ) flags are included, each as single fields in a property set,
70	and decomposed into structured property lists where possible.
71
72	5. <arch> satisfaction data for the target package at the time of
73	installation is recorded.
74
75	eg:
76
77	KEYWORDS="arch" + ACCEPT_KEYWORDS="~arch" -> [ "arch(~)" ]
78	KEYWORDS="~arch" + ACCEPT_KEYWORDS="~arch" -> [ "~arch(~)" ]
79	KEYWORDS="arch" + ACCEPT_KEYWORDS="arch" -> [ "arch" ]
80	KEYWORDS="" + ACCEPT_KEYWORDS="" -> [ "()" ]
81
82	This seems redundant, but this is basically suggesting "hey, if you're
83	insane and setting lots of different arches for accept keywords, that
84	would be relevant data to use to ignore your report. This data can also
85	be used with other data I'll mention later to isolate users with "mixed
86	keywording" setups.
87
88	6. For every dependency listed in *DEPEND, a dictionary/hash of
89
90	"specified atom" -> {
91	name -> resolved dependency name
92	version -> version of resolved dependency
93	arch -> [ satisfied arch spec as in #4 ]
94	sha1 -> Some kind of SHA1 that hopefully turns up in gentoo.git
95	}
96
97
98	is recorded in the response at the time of the result.
99
100	The "satisified arch spec" field is used to isolate anomalies in
101	keywording and user keyword mixing and filter out non-target reports
102	for stabilization data.
103
104	7. A Submitter Unique Identifier
105
106	8. Possibly a Submitter-Machine Unique Identifier.
107
108	9. The whole build log will be included compressed, verbatim.
109
110	This latter part will an independent option to the "reporting" feature,
111	because its a slightly more invasive privacy concern than the others,
112	in that, arbitrary code execution can leak private data.
113
114	Hence, people who turn this feature on have to know what they're
115	signing up for.
116
117	10. All of the above data is pooled and shipped as a single report, and
118	submitted to a "report server" and aggregated.
119
120
121	With all of the above, in the most native of situations, we can use
122	that data at very least to give us a lot more assurance than "well, 30
123	days passed, and nobody complained", because we'll have a paper trail
124	of a known countable number of successful installs, which while
125	not representative, are likely to still be more diverse and
126	reassuring of confidence than the deafening silence of no
127	feedback.
128
129	And in non-naive situations, the results for given versions can be
130	aggregated and compared, and factors that are present can be correlated
131	with failures statistically.
132
133	And this would give us a status board of "here's a bunch of
134	configurations that seem to be statisically more problematic than
135	others, might be worth investigating"
136
137	But there would be no burden to actually dive into the logs unless you
138	found clusters of failures from different sources failing under the
139	same scenarios ( And this is why not everyone has to send build logs
140	to be effective, just enough people have to report "x configuration
141	bad" and some subset of them have to provide elucidating logs ).
142
143	None of what I mention here is conceptually "new", I've just
144	re-explained the entire CPAN Testers model in terms relevant to Gentoo,
145	using Gentoo parts instead of CPAN parts.
146
147	And CPAN testers find it very effective at being assured they didn't
148	break anything: They ship a TRIAL release ( akin to our ~arch ), and
149	then wait a week or so while people download and test it.
150
151	And pretty much anyone can become "a tester", there's no barrier to
152	entry, and no requirements for membership. Just install the tools, get
153	yourself an ID, and start installing stuff with tests (the default),
154	and the tools you have will automatically fire off those reports to the
155	hive, and you get a big pretty matrix of "We're good here", and then
156	after no red results in some period, they go "hey, yep, we're good" and
157	ship a stable release.
158
159	Or maybe occasional pockets of "you dun goofed" where there will be a
160	problem you might have to look into ( sometimes those problems are
161	entirely invalid problems, ... this is somehow typically not an issue )
162
163	http://matrix.cpantesters.org/?dist=App-perlbrew+0.76
164
165	And you throw variants analysis into the mix and you get those other
166	facts compared and ranked by "Likelihood to be part of the problem"
167
168	http://analysis.cpantesters.org/solved?distv=App-perlbrew-0.76
169
170	^ you see here variant analysis found 3 common strings in the logs that
171	indicated a failure, and it pointed the finger directly at the failing
172	test as a result. And then in rank #3, you see its pointing a finger at
173	CPAN::Perl::Releases as "a possible problem highly correlated with
174	failures" with the -0.5 theta on version 2.88
175
176	Lo and behold, automated differential analysis has found the bug:
177
178	https://rt.cpan.org/Ticket/Display.html?id=116517
179
180	It still takes a human to
181
182	a) decide to look
183	b) decide the differential factors are useful enough to pursue
184	c) verify the problem manually by using the guidance given
185	d) manually file the bug
186
187	But the point here is we can actually build some infrastructure that
188	will give automated tooling some degree of assurance that "this can
189	probably be safely stabilized now, the testers aren't seeing any issues"
190
191	Its just also the sort of data collection that can lend itself to much
192	more powerful benefits as well.
193
194	The only hard parts are:
195
196	1. Making a good server to handle these reports that scales well
197	2. Making a good client for report generation, collection from PORTAGE
198	and submission
199	3. Getting people to turn on the feature
200	4. Getting enough people using the feature that the majority of the
201	"easy" stabilizations can happen hands-free.
202
203	And we don't even have to do the "Fancy" parts of it now:
204
205	Just pools of "package: arch = 100pass/0fail archb = 10pass/0 fail"
206
207	Would be a great start.
208
209	Because otherwise we're relying 100% on negative feedback, and assuming
210	that the absence of negative feedback is positive, when the reality
211	might be closer that the absence of negative feedback is that the
212	problems were too confusing to report as an explicit bug, the problems
213	faced were deemed unimportant to the person in question and they gave
214	up before they reported it, the user encountered some other entry
215	barrier in reporting, ..... or maybe, nobody is actually using the
216	package at all, so it could actually be completely broken and nobody
217	notices.
218
219	And it seems entirely hap-hazard to encourage tooling that not
220	builds upon that assumption.
221
222	At least with the manual stabilization process, you can be assured that
223	at least one human will personally install, test, and verify a package
224	works in at least one situation.
225
226	With a completely automated stabilization that relies on the absence of
227	negative feedback to stabilize, you're not even getting that.
228
229	Why bother with stabilization at all if the entire thing is merely
230	conjecture ?
231
232	Even a broken, flawed stabilization workflow done by teams of people
233	who are bad at testing is better than a stabilization workflow
234	implemented on conjecture of stability :P