Gentoo Archives: gentoo-dev

From: Alec Warner <antarus@g.o>
To: Gentoo Dev <gentoo-dev@l.g.o>
Subject: Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
Date: Tue, 05 May 2020 06:19:35
Message-Id: CAAr7Pr-3GMOrLjp1wNO5T4VYyUMjvQBYEW2D_4ULDN4__DcFCg@mail.gmail.com
In Reply to: Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation by Matt Turner
1 On Mon, May 4, 2020 at 10:14 PM Matt Turner <mattst88@g.o> wrote:
2
3 > On Mon, May 4, 2020 at 5:48 PM Thomas Deutschmann <whissi@g.o>
4 > wrote:
5 > >
6 > > On 2020-04-26 15:46, Kent Fredric wrote:
7 > > > On Sun, 26 Apr 2020 14:38:54 +0200
8 > > > Thomas Deutschmann <whissi@g.o> wrote:
9 > > >
10 > > >> Let's assume we will get reports that app-misc/foo is only installed
11 > 20
12 > > >> times. If you are going to judge based on this data, "Obviously,
13 > nobody
14 > > >> is using that package, it's stuck on <whatever>... safe to remove"
15 > your
16 > > >> view is biased:
17 > > >
18 > > > I see this as more like what bloom filters get you, but in reverse:
19 > > >
20 > > > [...]
21 > > >
22 > > > - But now, instead of having "we don't know if anybody uses this", you
23 > > > *can* have a "we know for sure somebody uses this".
24 > >
25 > > But how does that information really help us to decide anything in the
26 > end?
27 > >
28 > > Case A, stats are showing 0 users:
29 > >
30 > > Like said, we can't know if this is true or if this package is only used
31 > > in setups where people don't report stats.
32 > >
33 > >
34 > > Case B, stats are showing x users:
35 > >
36 > > Now what? Package from case A could have similar users -- we just don't
37 > > know. Assume firefox has 1.000 users, chromium has 500 users and vivaldi
38 > > doesn't show up in stats. How does that help us? Would this allow us to
39 > > skip publishing GLSAs for vivalid because we assume nobody in Gentoo is
40 > > using vivaldi? Does it allow Python project to go forward pushing a mask
41 > > for removal in case vivaldi would depend on Python version, Python
42 > > project want to get rid of? Would this allow Gentoo PR to make a public
43 > > statement like "Firefox is the most popular browser in Gentoo, twice as
44 > > users as chromium"?
45 >
46 > I hate the saying "the perfect is the enemy of the good" but I think
47 > it applies here.
48 >
49 > You're of course correct that we would not have perfect information.
50 > But the thing about statistics is that you can still know some things
51 > based on a sampling of that perfect information.
52 >
53 > I would personally like to have data on whether users of my packages
54 > have certain USE flags enabled. Knowing that would allow me to decide
55 > whether its worth the maintenance burden of supporting features that I
56 > *think* are very rarely used. If instead the data showed me that 50%
57 > of users had IUSE=xyz enabled, I probably wouldn't consider removing
58 > it.
59 >
60 > I think your example of potential misuse of data is a bit over dramatic.
61 >
62
63 Let me present the same point another way.
64
65 Today we have no data, so we make an arbitrary decision. It might be right
66 or wrong; and we may not know until after we decide.
67 This is traditionally things like "break them and they will come" type of
68 process. "Mask it, if they complain, I'll unmask it."
69
70 In the future, we could have this package data. It may influence decision
71 making. However I'm not sure from a decision-making standpoint that it is
72 strictly worse than no data.
73 The danger (which is what I think Whissi's concern is) is that it could
74 artificially increase decision certainty.
75
76 For example, if I have to decide whether to keep a package, or a flag, or
77 whatever. I might make an arbitrary decision. I'm aware it's arbitrary, it
78 might be wrong, and so I'm not super attached to such a decision. I'm not
79 *certain* about it; but I have to decide one way or the other[0]. Then I
80 move to a world with package data. Now I'm no longer making an arbitrary
81 decision; I'm making a decision based on *data*. The *data* tells me my
82 decision is correct, resulting in a more *certain* decision outcome. I
83 think this is the fallacy we want to avoid. The data can be informative but
84 there are significant biases in it that should result in very *little*
85 certainty added to decision making.
86
87 Making decisions based on incomplete data is just life though, so I'm
88 fairly skeptical of a "we shouldn't collect any data" type of mindset. I'd
89 be curious to see if we can instill a *culture* component around the use of
90 data in our development workflows.
91
92 -A
93
94 [0] There are a bunch of other cultural components here, like different
95 decision types (1 vs 2) and the ability to make a mistake in public and not
96 feel bad about it; so I'm aware reality does not reflect this trivial
97 example. But those are hallmarks of cultural markets I'd like to aim for in
98 Gentoo, so I would prefer to discuss a world where they exist ;)