Re: [gentoo-dev] [RFC] Anti-spam for goose - gentoo-dev

From:	Fabian Groffen <grobian@g.o>
To:	gentoo-dev@l.g.o
Subject:	Re: [gentoo-dev] [RFC] Anti-spam for goose
Date:	Thu, 21 May 2020 11:03:34
Message-Id:	`20200521110324.GG13710@gentoo.org`
In Reply to:	[gentoo-dev] [RFC] Anti-spam for goose by "Michał Górny"

1

Hi,

2

3

On 21-05-2020 10:47:07 +0200, Michał Górny wrote:

4

> Hi,

5

>

6

> TL;DR: I'm looking for opinions on how to protect goose from spam,

7

> i.e. mass fake submissions.

8

>

9

>

10

> Problem

11

> =======

12

> Goose currently lacks proper limiting of submitted data.  The only

13

> limiter currently in place is based on unique submitter id that is

14

> randomly generated at setup time and in full control of the submitter.

15

> This only protects against accidental duplicates but it can't protect

16

> against deliberate action.

17

>

18

> An attacker could easily submit thousands (millions?) of fake entries by

19

> issuing a lot of requests with different ids.  Creating them is

20

> as trivial as using successive numbers.  The potential damage includes:

21

22

Perhaps you could consider something like a reputation system.  I'm

23

thinking of things like only publishing results after X hours when an id

24

is new (graylisting?), and gradually build up "trust" there.  In the X

25

hours you could then determine something is potentially fraud if you see

26

a new user id, loads of submissions from same IP, etc. what you describe

27

below I think.

28

29

The reputation logic could further build on if it appears to follow a

30

norm, e.g. compilation times which fall in the average given the

31

cpu/arch configuration.

32

Another way would be to see submissions for packages that are actually

33

bumped/stabilised in the tree, to score an id as more likely to be

34

genuine.

35

36

I think it will be a tad complicated, but static limiting might be as

37

easy to circumvent as you block it, as has been pointed out already.

38

39

Perhaps, it is fruitful to think of the reverse, when is something

40

obviously bad?  When a single (obscure?) package is suddenly reported

41

many times by new ids?  When a single id generates hundreds or thousands

42

of package submissions (is it a cluster being misconfigured, many

43

identical packages, or what seems to be an a to z scan).

44

Thing is, would a single "fake" submission (that IMO will unlikely be

45

ever noticed) screw up the overall state of things?  I think the

46

fuzzyness of the system as a whole should cover for these.  It is pure

47

poisioning that should be able to be mitigated, and I agree with you

48

preferably most of it blocked by default.  Fact probably is that it will

49

happen nevertheless.

50

51

That brings me to the thought: are there things that can be done to make

52

sure a fraudulous action can be easily undone or negated somehow?  E.g.

53

should a log be kept, or some action to rollback and replay.  Sorry to

54

have no concrete examples here.

55

56

Fabian

57

58

>

59

> - distorting the metrics to the point of it being useless (even though

60

> some people consider it useless by design).

61

>

62

> - submitting lots of arbitrary data to cause DoS via growing

63

> the database until no disk space is left.

64

>

65

> - blocking large range of valid user ids, causing collisions with

66

> legitimate users more likely.

67

>

68

> I don't think it worthwhile to discuss the motivation for doing so:

69

> whether it would be someone wishing harm to Gentoo, disagreeing with

70

> the project or merely wanting to try and see if it would work.  The case

71

> of SKS keyservers teaches us a lesson that you can't leave holes like

72

> this open a long time because someone eventually will abuse them.

73

>

74

>

75

> Option 1: IP-based limiting

76

> ===========================

77

> The original idea was to set a hard limit of submissions per week based

78

> on IP address of the submitter.  This has (at least as far as IPv4 is

79

> concerned) the advantages that:

80

>

81

> - submitted has limited control of his IP address (i.e. he can't just

82

> submit stuff using arbitrary data)

83

>

84

> - IP address range is naturally limited

85

>

86

> - IP addresses have non-zero cost

87

>

88

> This method could strongly reduce the number of fake submissions one

89

> attacker could devise.  However, it has a few problems too:

90

>

91

> - a low limit would harm legitimate submitters sharing IP address

92

> (i.e. behind NAT)

93

>

94

> - it actively favors people with access to large number of IP addresses

95

>

96

> - it doesn't map cleanly to IPv6 (where some people may have just one IP

97

> address, and others may have whole /64 or /48 ranges)

98

>

99

> - it may cause problems for anonymizing network users (and we want to

100

> encourage Tor usage for privacy)

101

>

102

> All this considered, IP address limiting can't be used the primary

103

> method of preventing fake submissions.  However, I suppose it could work

104

> as an additional DoS prevention, limiting the number of submissions from

105

> a single address over short periods of time.

106

>

107

> Example: if we limit to 10 requests an hour, then a single IP can be

108

> used ot manufacture at most 240 submissions a day.  This might be

109

> sufficient to render them unusable but should keep the database

110

> reasonably safe.

111

>

112

>

113

> Option 2: proof-of-work

114

> =======================

115

> An alternative of using a proof-of-work algorithm was suggested to me

116

> yesterday.  The idea is that every submission has to be accompanied with

117

> the result of some cumbersome calculation that can't be trivially run

118

> in parallel or optimized out to dedicated hardware.

119

>

120

> On the plus side, it would rely more on actual physical hardware than IP

121

> addresses provided by ISPs.  While it would be a waste of CPU time

122

> and memory, doing it just once a week wouldn't be that much harm.

123

>

124

> On the minus side, it would penalize people with weak hardware.

125

>

126

> For example, 'time hashcash -m -b 28 -r test' gives:

127

>

128

> - 34 s (-s estimated 38 s) on Ryzen 5 3600

129

>

130

> - 3 minutes (estimated 92 s) on some old 32-bit Celeron M

131

>

132

> At the same time, it would still permit a lot of fake submissions.  For

133

> example, randomx [1] claims to require 2G of memory in fast mode.  This

134

> would still allow me to use 7 threads.  If we adjusted the algorithm to

135

> take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k

136

> submissions a day.

137

>

138

> So in the end, while this is interesting, it doesn't seem like

139

> a workable anti-spam measure.

140

>

141

>

142

> Option 3: explicit CAPTCHA

143

> ==========================

144

> A traditional way of dealing with spam -- require every new system

145

> identifier to be confirmed by solving a CAPTCHA (or a few identifiers

146

> for one CAPTCHA).

147

>

148

> The advantage of this method is that it requires a real human work to be

149

> performed, effectively limiting the ability to submit spam.

150

> The disadvantage is that it is cumbersome to users, so many of them will

151

> just resign from participating.

152

>

153

>

154

> Other ideas

155

> ===========

156

> Do you have any other ideas on how we could resolve this?

157

>

158

>

159

> [1] https://github.com/tevador/RandomX

160

>

161

>

162

> --

163

> Best regards,

164

> Michał Górny

165

>

--

170

Fabian Groffen

171

Gentoo on a different level

Gentoo Archives: gentoo-dev

Attachments

1	Hi,
2
3	On 21-05-2020 10:47:07 +0200, Michał Górny wrote:
4	> Hi,
5	>
6	> TL;DR: I'm looking for opinions on how to protect goose from spam,
7	> i.e. mass fake submissions.
8	>
9	>
10	> Problem
11	> =======
12	> Goose currently lacks proper limiting of submitted data. The only
13	> limiter currently in place is based on unique submitter id that is
14	> randomly generated at setup time and in full control of the submitter.
15	> This only protects against accidental duplicates but it can't protect
16	> against deliberate action.
17	>
18	> An attacker could easily submit thousands (millions?) of fake entries by
19	> issuing a lot of requests with different ids. Creating them is
20	> as trivial as using successive numbers. The potential damage includes:
21
22	Perhaps you could consider something like a reputation system. I'm
23	thinking of things like only publishing results after X hours when an id
24	is new (graylisting?), and gradually build up "trust" there. In the X
25	hours you could then determine something is potentially fraud if you see
26	a new user id, loads of submissions from same IP, etc. what you describe
27	below I think.
28
29	The reputation logic could further build on if it appears to follow a
30	norm, e.g. compilation times which fall in the average given the
31	cpu/arch configuration.
32	Another way would be to see submissions for packages that are actually
33	bumped/stabilised in the tree, to score an id as more likely to be
34	genuine.
35
36	I think it will be a tad complicated, but static limiting might be as
37	easy to circumvent as you block it, as has been pointed out already.
38
39	Perhaps, it is fruitful to think of the reverse, when is something
40	obviously bad? When a single (obscure?) package is suddenly reported
41	many times by new ids? When a single id generates hundreds or thousands
42	of package submissions (is it a cluster being misconfigured, many
43	identical packages, or what seems to be an a to z scan).
44	Thing is, would a single "fake" submission (that IMO will unlikely be
45	ever noticed) screw up the overall state of things? I think the
46	fuzzyness of the system as a whole should cover for these. It is pure
47	poisioning that should be able to be mitigated, and I agree with you
48	preferably most of it blocked by default. Fact probably is that it will
49	happen nevertheless.
50
51	That brings me to the thought: are there things that can be done to make
52	sure a fraudulous action can be easily undone or negated somehow? E.g.
53	should a log be kept, or some action to rollback and replay. Sorry to
54	have no concrete examples here.
55
56	Fabian
57
58	>
59	> - distorting the metrics to the point of it being useless (even though
60	> some people consider it useless by design).
61	>
62	> - submitting lots of arbitrary data to cause DoS via growing
63	> the database until no disk space is left.
64	>
65	> - blocking large range of valid user ids, causing collisions with
66	> legitimate users more likely.
67	>
68	> I don't think it worthwhile to discuss the motivation for doing so:
69	> whether it would be someone wishing harm to Gentoo, disagreeing with
70	> the project or merely wanting to try and see if it would work. The case
71	> of SKS keyservers teaches us a lesson that you can't leave holes like
72	> this open a long time because someone eventually will abuse them.
73	>
74	>
75	> Option 1: IP-based limiting
76	> ===========================
77	> The original idea was to set a hard limit of submissions per week based
78	> on IP address of the submitter. This has (at least as far as IPv4 is
79	> concerned) the advantages that:
80	>
81	> - submitted has limited control of his IP address (i.e. he can't just
82	> submit stuff using arbitrary data)
83	>
84	> - IP address range is naturally limited
85	>
86	> - IP addresses have non-zero cost
87	>
88	> This method could strongly reduce the number of fake submissions one
89	> attacker could devise. However, it has a few problems too:
90	>
91	> - a low limit would harm legitimate submitters sharing IP address
92	> (i.e. behind NAT)
93	>
94	> - it actively favors people with access to large number of IP addresses
95	>
96	> - it doesn't map cleanly to IPv6 (where some people may have just one IP
97	> address, and others may have whole /64 or /48 ranges)
98	>
99	> - it may cause problems for anonymizing network users (and we want to
100	> encourage Tor usage for privacy)
101	>
102	> All this considered, IP address limiting can't be used the primary
103	> method of preventing fake submissions. However, I suppose it could work
104	> as an additional DoS prevention, limiting the number of submissions from
105	> a single address over short periods of time.
106	>
107	> Example: if we limit to 10 requests an hour, then a single IP can be
108	> used ot manufacture at most 240 submissions a day. This might be
109	> sufficient to render them unusable but should keep the database
110	> reasonably safe.
111	>
112	>
113	> Option 2: proof-of-work
114	> =======================
115	> An alternative of using a proof-of-work algorithm was suggested to me
116	> yesterday. The idea is that every submission has to be accompanied with
117	> the result of some cumbersome calculation that can't be trivially run
118	> in parallel or optimized out to dedicated hardware.
119	>
120	> On the plus side, it would rely more on actual physical hardware than IP
121	> addresses provided by ISPs. While it would be a waste of CPU time
122	> and memory, doing it just once a week wouldn't be that much harm.
123	>
124	> On the minus side, it would penalize people with weak hardware.
125	>
126	> For example, 'time hashcash -m -b 28 -r test' gives:
127	>
128	> - 34 s (-s estimated 38 s) on Ryzen 5 3600
129	>
130	> - 3 minutes (estimated 92 s) on some old 32-bit Celeron M
131	>
132	> At the same time, it would still permit a lot of fake submissions. For
133	> example, randomx [1] claims to require 2G of memory in fast mode. This
134	> would still allow me to use 7 threads. If we adjusted the algorithm to
135	> take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k
136	> submissions a day.
137	>
138	> So in the end, while this is interesting, it doesn't seem like
139	> a workable anti-spam measure.
140	>
141	>
142	> Option 3: explicit CAPTCHA
143	> ==========================
144	> A traditional way of dealing with spam -- require every new system
145	> identifier to be confirmed by solving a CAPTCHA (or a few identifiers
146	> for one CAPTCHA).
147	>
148	> The advantage of this method is that it requires a real human work to be
149	> performed, effectively limiting the ability to submit spam.
150	> The disadvantage is that it is cumbersome to users, so many of them will
151	> just resign from participating.
152	>
153	>
154	> Other ideas
155	> ===========
156	> Do you have any other ideas on how we could resolve this?
157	>
158	>
159	> [1] https://github.com/tevador/RandomX
160	>
161	>
162	> --
163	> Best regards,
164	> Michał Górny
165	>
166
167
168
169	--
170	Fabian Groffen
171	Gentoo on a different level