1 |
On Mon, May 4, 2020 at 10:14 PM Matt Turner <mattst88@g.o> wrote: |
2 |
|
3 |
> On Mon, May 4, 2020 at 5:48 PM Thomas Deutschmann <whissi@g.o> |
4 |
> wrote: |
5 |
> > |
6 |
> > On 2020-04-26 15:46, Kent Fredric wrote: |
7 |
> > > On Sun, 26 Apr 2020 14:38:54 +0200 |
8 |
> > > Thomas Deutschmann <whissi@g.o> wrote: |
9 |
> > > |
10 |
> > >> Let's assume we will get reports that app-misc/foo is only installed |
11 |
> 20 |
12 |
> > >> times. If you are going to judge based on this data, "Obviously, |
13 |
> nobody |
14 |
> > >> is using that package, it's stuck on <whatever>... safe to remove" |
15 |
> your |
16 |
> > >> view is biased: |
17 |
> > > |
18 |
> > > I see this as more like what bloom filters get you, but in reverse: |
19 |
> > > |
20 |
> > > [...] |
21 |
> > > |
22 |
> > > - But now, instead of having "we don't know if anybody uses this", you |
23 |
> > > *can* have a "we know for sure somebody uses this". |
24 |
> > |
25 |
> > But how does that information really help us to decide anything in the |
26 |
> end? |
27 |
> > |
28 |
> > Case A, stats are showing 0 users: |
29 |
> > |
30 |
> > Like said, we can't know if this is true or if this package is only used |
31 |
> > in setups where people don't report stats. |
32 |
> > |
33 |
> > |
34 |
> > Case B, stats are showing x users: |
35 |
> > |
36 |
> > Now what? Package from case A could have similar users -- we just don't |
37 |
> > know. Assume firefox has 1.000 users, chromium has 500 users and vivaldi |
38 |
> > doesn't show up in stats. How does that help us? Would this allow us to |
39 |
> > skip publishing GLSAs for vivalid because we assume nobody in Gentoo is |
40 |
> > using vivaldi? Does it allow Python project to go forward pushing a mask |
41 |
> > for removal in case vivaldi would depend on Python version, Python |
42 |
> > project want to get rid of? Would this allow Gentoo PR to make a public |
43 |
> > statement like "Firefox is the most popular browser in Gentoo, twice as |
44 |
> > users as chromium"? |
45 |
> |
46 |
> I hate the saying "the perfect is the enemy of the good" but I think |
47 |
> it applies here. |
48 |
> |
49 |
> You're of course correct that we would not have perfect information. |
50 |
> But the thing about statistics is that you can still know some things |
51 |
> based on a sampling of that perfect information. |
52 |
> |
53 |
> I would personally like to have data on whether users of my packages |
54 |
> have certain USE flags enabled. Knowing that would allow me to decide |
55 |
> whether its worth the maintenance burden of supporting features that I |
56 |
> *think* are very rarely used. If instead the data showed me that 50% |
57 |
> of users had IUSE=xyz enabled, I probably wouldn't consider removing |
58 |
> it. |
59 |
> |
60 |
> I think your example of potential misuse of data is a bit over dramatic. |
61 |
> |
62 |
|
63 |
Let me present the same point another way. |
64 |
|
65 |
Today we have no data, so we make an arbitrary decision. It might be right |
66 |
or wrong; and we may not know until after we decide. |
67 |
This is traditionally things like "break them and they will come" type of |
68 |
process. "Mask it, if they complain, I'll unmask it." |
69 |
|
70 |
In the future, we could have this package data. It may influence decision |
71 |
making. However I'm not sure from a decision-making standpoint that it is |
72 |
strictly worse than no data. |
73 |
The danger (which is what I think Whissi's concern is) is that it could |
74 |
artificially increase decision certainty. |
75 |
|
76 |
For example, if I have to decide whether to keep a package, or a flag, or |
77 |
whatever. I might make an arbitrary decision. I'm aware it's arbitrary, it |
78 |
might be wrong, and so I'm not super attached to such a decision. I'm not |
79 |
*certain* about it; but I have to decide one way or the other[0]. Then I |
80 |
move to a world with package data. Now I'm no longer making an arbitrary |
81 |
decision; I'm making a decision based on *data*. The *data* tells me my |
82 |
decision is correct, resulting in a more *certain* decision outcome. I |
83 |
think this is the fallacy we want to avoid. The data can be informative but |
84 |
there are significant biases in it that should result in very *little* |
85 |
certainty added to decision making. |
86 |
|
87 |
Making decisions based on incomplete data is just life though, so I'm |
88 |
fairly skeptical of a "we shouldn't collect any data" type of mindset. I'd |
89 |
be curious to see if we can instill a *culture* component around the use of |
90 |
data in our development workflows. |
91 |
|
92 |
-A |
93 |
|
94 |
[0] There are a bunch of other cultural components here, like different |
95 |
decision types (1 vs 2) and the ability to make a mistake in public and not |
96 |
feel bad about it; so I'm aware reality does not reflect this trivial |
97 |
example. But those are hallmarks of cultural markets I'd like to aim for in |
98 |
Gentoo, so I would prefer to discuss a world where they exist ;) |