1 |
On Mon, 8 Aug 2016 19:07:04 -0700 |
2 |
Jack Morgan <jmorgan@g.o> wrote: |
3 |
|
4 |
> On 08/08/16 05:35, Marek Szuba wrote: |
5 |
> > |
6 |
> > Bottom line: I would say we do need some way of streamlining ebuild |
7 |
> > stabilisation. |
8 |
> |
9 |
> I vote we fix this problem. I'm tired of having this same discussion |
10 |
> ever 6 or 12 months. I'd like to see less policy discussion and more |
11 |
> technical solutions to the problems we face. |
12 |
> |
13 |
> I propose calling for volunteers to create a new project that works on |
14 |
> solving our stabilization problem. I see that looking like the |
15 |
> following: |
16 |
> |
17 |
> 1) project identifies the problem(s) with real data from Bugzilla and |
18 |
> the portage tree. |
19 |
> |
20 |
> 2) new project defines a technical proposal to fixing this issue, then |
21 |
> presents it to the developer community for feedback. This would |
22 |
> include defining tools needed or used |
23 |
> |
24 |
> 3) start working on solution + define future roadmap |
25 |
> |
26 |
> |
27 |
> All processes and policies should be on the table for negotiating in |
28 |
> the potential solution. If we need to reinvent the wheel, then let's |
29 |
> do it. |
30 |
> |
31 |
> To be honest, adding more policy just ends up making everyone unhappy |
32 |
> one way or the other. |
33 |
> |
34 |
> |
35 |
|
36 |
There's a potential way to garner a technical solution that somewhat |
37 |
alleviates the need for such rigourous arch testers, and without |
38 |
degrading the stabilisation mechanic to "blind monkey system that |
39 |
stabilises based on conjecture". |
40 |
|
41 |
I've mentioned it before ages ago on the Gentoo Dev list, somewhere. |
42 |
|
43 |
The idea is basically to instrument portage to have an (optional) |
44 |
feature that when turned on, records and submits certain facts about |
45 |
every failed or successful install, with the objective being to |
46 |
essentially spread the load out of what `tatt` does organically over |
47 |
the participant base. |
48 |
|
49 |
1. Firstly, make no demands of homoegenity or even sanity for a users |
50 |
system to participate. Ever thing they throw at this system I'm about |
51 |
to propose should be considered "valid" |
52 |
|
53 |
2. Every time a package is installed, or attempted to be installed, the |
54 |
exit of that installation is qualified in one of a number of ways: |
55 |
|
56 |
- installed OK without tests |
57 |
- installed OK with tests |
58 |
- failed tests |
59 |
- failed install |
60 |
- failed compile |
61 |
- failed configure |
62 |
|
63 |
Each of these is a single state in a single field. |
64 |
|
65 |
3. The Name, Version, and SHA1 of the ebuild that generated the report. |
66 |
|
67 |
|
68 |
4. The USE flags and any other pertinent ( and carefully selected by |
69 |
Gentoo ) flags are included, each as single fields in a property set, |
70 |
and decomposed into structured property lists where possible. |
71 |
|
72 |
5. <arch> satisfaction data for the target package at the time of |
73 |
installation is recorded. |
74 |
|
75 |
eg: |
76 |
|
77 |
KEYWORDS="arch" + ACCEPT_KEYWORDS="~arch" -> [ "arch(~)" ] |
78 |
KEYWORDS="~arch" + ACCEPT_KEYWORDS="~arch" -> [ "~arch(~)" ] |
79 |
KEYWORDS="arch" + ACCEPT_KEYWORDS="arch" -> [ "arch" ] |
80 |
KEYWORDS="" + ACCEPT_KEYWORDS="**" -> [ "(**)" ] |
81 |
|
82 |
This seems redundant, but this is basically suggesting "hey, if you're |
83 |
insane and setting lots of different arches for accept keywords, that |
84 |
would be relevant data to use to ignore your report. This data can also |
85 |
be used with other data I'll mention later to isolate users with "mixed |
86 |
keywording" setups. |
87 |
|
88 |
6. For every dependency listed in *DEPEND, a dictionary/hash of |
89 |
|
90 |
"specified atom" -> { |
91 |
name -> resolved dependency name |
92 |
version -> version of resolved dependency |
93 |
arch -> [ satisfied arch spec as in #4 ] |
94 |
sha1 -> Some kind of SHA1 that hopefully turns up in gentoo.git |
95 |
} |
96 |
|
97 |
|
98 |
is recorded in the response at the time of the result. |
99 |
|
100 |
The "satisified arch spec" field is used to isolate anomalies in |
101 |
keywording and user keyword mixing and filter out non-target reports |
102 |
for stabilization data. |
103 |
|
104 |
7. A Submitter Unique Identifier |
105 |
|
106 |
8. Possibly a Submitter-Machine Unique Identifier. |
107 |
|
108 |
9. The whole build log will be included compressed, verbatim. |
109 |
|
110 |
This latter part will an independent option to the "reporting" feature, |
111 |
because its a slightly more invasive privacy concern than the others, |
112 |
in that, arbitrary code execution can leak private data. |
113 |
|
114 |
Hence, people who turn this feature on have to know what they're |
115 |
signing up for. |
116 |
|
117 |
10. All of the above data is pooled and shipped as a single report, and |
118 |
submitted to a "report server" and aggregated. |
119 |
|
120 |
|
121 |
With all of the above, in the most native of situations, we can use |
122 |
that data at very least to give us a lot more assurance than "well, 30 |
123 |
days passed, and nobody complained", because we'll have a paper trail |
124 |
of a known countable number of successful installs, which while |
125 |
not representative, are likely to still be more diverse and |
126 |
reassuring of confidence than the deafening silence of no |
127 |
feedback. |
128 |
|
129 |
And in non-naive situations, the results for given versions can be |
130 |
aggregated and compared, and factors that are present can be correlated |
131 |
with failures statistically. |
132 |
|
133 |
And this would give us a status board of "here's a bunch of |
134 |
configurations that seem to be statisically more problematic than |
135 |
others, might be worth investigating" |
136 |
|
137 |
But there would be no burden to actually dive into the logs unless you |
138 |
found clusters of failures from different sources failing under the |
139 |
same scenarios ( And this is why not everyone *has* to send build logs |
140 |
to be effective, just enough people have to report "x configuration |
141 |
bad" and some subset of them have to provide elucidating logs ). |
142 |
|
143 |
None of what I mention here is conceptually "new", I've just |
144 |
re-explained the entire CPAN Testers model in terms relevant to Gentoo, |
145 |
using Gentoo parts instead of CPAN parts. |
146 |
|
147 |
And CPAN testers find it *very effective* at being assured they didn't |
148 |
break anything: They ship a TRIAL release ( akin to our ~arch ), and |
149 |
then wait a week or so while people download and test it. |
150 |
|
151 |
And pretty much anyone can become "a tester", there's no barrier to |
152 |
entry, and no requirements for membership. Just install the tools, get |
153 |
yourself an ID, and start installing stuff with tests (the default), |
154 |
and the tools you have will automatically fire off those reports to the |
155 |
hive, and you get a big pretty matrix of "We're good here", and then |
156 |
after no red results in some period, they go "hey, yep, we're good" and |
157 |
ship a stable release. |
158 |
|
159 |
Or maybe occasional pockets of "you dun goofed" where there will be a |
160 |
problem you might have to look into ( sometimes those problems are |
161 |
entirely invalid problems, ... this is somehow typically not an issue ) |
162 |
|
163 |
http://matrix.cpantesters.org/?dist=App-perlbrew+0.76 |
164 |
|
165 |
And you throw variants analysis into the mix and you get those other |
166 |
facts compared and ranked by "Likelihood to be part of the problem" |
167 |
|
168 |
http://analysis.cpantesters.org/solved?distv=App-perlbrew-0.76 |
169 |
|
170 |
^ you see here variant analysis found 3 common strings in the logs that |
171 |
indicated a failure, and it pointed the finger directly at the failing |
172 |
test as a result. And then in rank #3, you see its pointing a finger at |
173 |
CPAN::Perl::Releases as "a possible problem highly correlated with |
174 |
failures" with the -0.5 theta on version 2.88 |
175 |
|
176 |
Lo and behold, automated differential analysis has found the bug: |
177 |
|
178 |
https://rt.cpan.org/Ticket/Display.html?id=116517 |
179 |
|
180 |
It still takes a human to |
181 |
|
182 |
a) decide to look |
183 |
b) decide the differential factors are useful enough to pursue |
184 |
c) verify the problem manually by using the guidance given |
185 |
d) manually file the bug |
186 |
|
187 |
But the point here is we can actually build some infrastructure that |
188 |
will give automated tooling some degree of assurance that "this can |
189 |
probably be safely stabilized now, the testers aren't seeing any issues" |
190 |
|
191 |
Its just also the sort of data collection that can lend itself to much |
192 |
more powerful benefits as well. |
193 |
|
194 |
The only hard parts are: |
195 |
|
196 |
1. Making a good server to handle these reports that scales well |
197 |
2. Making a good client for report generation, collection from PORTAGE |
198 |
and submission |
199 |
3. Getting people to turn on the feature |
200 |
4. Getting enough people using the feature that the majority of the |
201 |
"easy" stabilizations can happen hands-free. |
202 |
|
203 |
And we don't even have to do the "Fancy" parts of it now: |
204 |
|
205 |
Just pools of "package: arch = 100pass/0fail archb = 10pass/0 fail" |
206 |
|
207 |
Would be a great start. |
208 |
|
209 |
Because otherwise we're relying 100% on negative feedback, and assuming |
210 |
that the absence of negative feedback is positive, when the reality |
211 |
might be closer that the absence of negative feedback is that the |
212 |
problems were too confusing to report as an explicit bug, the problems |
213 |
faced were deemed unimportant to the person in question and they gave |
214 |
up before they reported it, the user encountered some other entry |
215 |
barrier in reporting, ..... or maybe, nobody is actually using the |
216 |
package at all, so it could actually be completely broken and nobody |
217 |
notices. |
218 |
|
219 |
And it seems entirely hap-hazard to encourage tooling that not |
220 |
*builds* upon that assumption. |
221 |
|
222 |
At least with the manual stabilization process, you can be assured that |
223 |
at least one human will personally install, test, and verify a package |
224 |
works in at least one situation. |
225 |
|
226 |
With a completely automated stabilization that relies on the absence of |
227 |
negative feedback to stabilize, you're *not even getting that*. |
228 |
|
229 |
Why bother with stabilization at all if the entire thing is merely |
230 |
*conjecture* ? |
231 |
|
232 |
Even a broken, flawed stabilization workflow done by teams of people |
233 |
who are bad at testing is better than a stabilization workflow |
234 |
implemented on conjecture of stability :P |