Re: [gentoo-dev] [RFC] Anti-spam for goose - gentoo-dev

From:	Tomas Mozes <hydrapolic@×××××.com>
To:	gentoo-dev@l.g.o
Subject:	Re: [gentoo-dev] [RFC] Anti-spam for goose
Date:	Thu, 21 May 2020 10:37:36
Message-Id:	`CAG6MAzSxJyLhBdQmjemoOC1dE4LPdKFp+gz+Y=99eiB1Jj0oAg@mail.gmail.com`
In Reply to:	Re: [gentoo-dev] [RFC] Anti-spam for goose by "Michał Górny"

1

On Thu, May 21, 2020 at 12:10 PM Michał Górny <mgorny@g.o> wrote:

2

3

> On Thu, 2020-05-21 at 11:48 +0200, Tomas Mozes wrote:

4

> > On Thu, May 21, 2020 at 10:47 AM Michał Górny <mgorny@g.o> wrote:

5

> >

6

> > > Hi,

7

> > >

8

> > > TL;DR: I'm looking for opinions on how to protect goose from spam,

9

> > > i.e. mass fake submissions.

10

> > >

11

> > >

12

> > > Problem

13

> > > =======

14

> > > Goose currently lacks proper limiting of submitted data.  The only

15

> > > limiter currently in place is based on unique submitter id that is

16

> > > randomly generated at setup time and in full control of the submitter.

17

> > > This only protects against accidental duplicates but it can't protect

18

> > > against deliberate action.

19

> > >

20

> > > An attacker could easily submit thousands (millions?) of fake entries

21

> by

22

> > > issuing a lot of requests with different ids.  Creating them is

23

> > > as trivial as using successive numbers.  The potential damage includes:

24

> > >

25

> > > - distorting the metrics to the point of it being useless (even though

26

> > > some people consider it useless by design).

27

> > >

28

> > > - submitting lots of arbitrary data to cause DoS via growing

29

> > > the database until no disk space is left.

30

> > >

31

> > > - blocking large range of valid user ids, causing collisions with

32

> > > legitimate users more likely.

33

> > >

34

> > > I don't think it worthwhile to discuss the motivation for doing so:

35

> > > whether it would be someone wishing harm to Gentoo, disagreeing with

36

> > > the project or merely wanting to try and see if it would work.  The

37

> case

38

> > > of SKS keyservers teaches us a lesson that you can't leave holes like

39

> > > this open a long time because someone eventually will abuse them.

40

> > >

41

> > >

42

> > > Option 1: IP-based limiting

43

> > > ===========================

44

> > > The original idea was to set a hard limit of submissions per week based

45

> > > on IP address of the submitter.  This has (at least as far as IPv4 is

46

> > > concerned) the advantages that:

47

> > >

48

> > > - submitted has limited control of his IP address (i.e. he can't just

49

> > > submit stuff using arbitrary data)

50

> > >

51

> > > - IP address range is naturally limited

52

> > >

53

> > > - IP addresses have non-zero cost

54

> > >

55

> > > This method could strongly reduce the number of fake submissions one

56

> > > attacker could devise.  However, it has a few problems too:

57

> > >

58

> > > - a low limit would harm legitimate submitters sharing IP address

59

> > > (i.e. behind NAT)

60

> > >

61

> > > - it actively favors people with access to large number of IP addresses

62

> > >

63

> > > - it doesn't map cleanly to IPv6 (where some people may have just one

64

> IP

65

> > > address, and others may have whole /64 or /48 ranges)

66

> > >

67

> > > - it may cause problems for anonymizing network users (and we want to

68

> > > encourage Tor usage for privacy)

69

> > >

70

> > > All this considered, IP address limiting can't be used the primary

71

> > > method of preventing fake submissions.  However, I suppose it could

72

> work

73

> > > as an additional DoS prevention, limiting the number of submissions

74

> from

75

> > > a single address over short periods of time.

76

> > >

77

> > > Example: if we limit to 10 requests an hour, then a single IP can be

78

> > > used ot manufacture at most 240 submissions a day.  This might be

79

> > > sufficient to render them unusable but should keep the database

80

> > > reasonably safe.

81

> > >

82

> > >

83

> > > Option 2: proof-of-work

84

> > > =======================

85

> > > An alternative of using a proof-of-work algorithm was suggested to me

86

> > > yesterday.  The idea is that every submission has to be accompanied

87

> with

88

> > > the result of some cumbersome calculation that can't be trivially run

89

> > > in parallel or optimized out to dedicated hardware.

90

> > >

91

> > > On the plus side, it would rely more on actual physical hardware than

92

> IP

93

> > > addresses provided by ISPs.  While it would be a waste of CPU time

94

> > > and memory, doing it just once a week wouldn't be that much harm.

95

> > >

96

> > > On the minus side, it would penalize people with weak hardware.

97

> > >

98

> > > For example, 'time hashcash -m -b 28 -r test' gives:

99

> > >

100

> > > - 34 s (-s estimated 38 s) on Ryzen 5 3600

101

> > >

102

> > > - 3 minutes (estimated 92 s) on some old 32-bit Celeron M

103

> > >

104

> > > At the same time, it would still permit a lot of fake submissions.  For

105

> > > example, randomx [1] claims to require 2G of memory in fast mode.  This

106

> > > would still allow me to use 7 threads.  If we adjusted the algorithm to

107

> > > take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k

108

> > > submissions a day.

109

> > >

110

> > > So in the end, while this is interesting, it doesn't seem like

111

> > > a workable anti-spam measure.

112

> > >

113

> > >

114

> > > Option 3: explicit CAPTCHA

115

> > > ==========================

116

> > > A traditional way of dealing with spam -- require every new system

117

> > > identifier to be confirmed by solving a CAPTCHA (or a few identifiers

118

> > > for one CAPTCHA).

119

> > >

120

> > > The advantage of this method is that it requires a real human work to

121

> be

122

> > > performed, effectively limiting the ability to submit spam.

123

> > > The disadvantage is that it is cumbersome to users, so many of them

124

> will

125

> > > just resign from participating.

126

> > >

127

> > >

128

> > > Other ideas

129

> > > ===========

130

> > > Do you have any other ideas on how we could resolve this?

131

> > >

132

> > >

133

> > > [1] https://github.com/tevador/RandomX

134

> > >

135

> > >

136

> > > --

137

> > > Best regards,

138

> > > Michał Górny

139

> > >

140

> >

141

> >

142

> > Sadly, the problem with IP addresses is (in this case), that there are

143

> > anonymous. One can easily start an attack with thousands of IPs (all

144

> around

145

> > the world).

146

> >

147

> > One solution would be to introduce user accounts:

148

> > - one needs to register with an email

149

>

150

> Problem 1: you can trivially mass-create email addresses.

151

>

152

>

153

IP verification:

154

- get enough IPs (botnet) and send your payload

155

156

User verification:

157

- get an email, verify account, solve a captcha

158

159

I know if someone wants to, he'll try to bypass the user verification, but

160

it's just more work to do. We can also enforce IP restrictions and use a

161

combination of both.

162

163

> - you can rate limit based on the client (not the IP)

164

> >

165

> > For example I've 200 servers, I'd create one account, verify my email

166

> > (maybe captcha too) and deploy a config with my token on all servers.

167

> Then

168

> > I'd setup a cron job on every server to submit stats. A token can have

169

> some

170

> > lifetime and you could create a new one when the old is about to expire.

171

> >

172

> > If you discover I'm doing false reports, you'd block all my submissions.

173

> I

174

> > can still do fake submissions, but you'd need a per-host verification to

175

> > avoid that.

176

> >

177

>

178

> Problem 2: we can't really discover this because the goal is to protect

179

> users' privacy.  The best we can do is to discover that someone is

180

> submitting a lot from a single account (but are them legitimate?).

181

> But then, we can just block them.

182

>

183

> But in the end, this has the same problem as CAPTCHA -- or maybe it's

184

> even worse.  It requires additional effort from the users, effectively

185

> making it less likely for them to participate.  Furthermore, it requires

186

> them to submit e-mail addresses which they may consider PII.  Even if we

187

> don't store them permanently but just use for initial verification, they

188

> still could choose not to participate.

189

>

190

191

I think if someone wants to participate and believes the cause he will.

192

Many of the users are on bugzila anyway, so the email is on the Gentoo side

193

anyway. Contributors have their emails in each Gentoo commit.

194

195

If spamming is a serious problem you can turn it into an invite-only system

196

creating a chain of trust.

197

198

199

> --

200

> Best regards,

201

> Michał Górny

202

>

203

>

1	On Thu, May 21, 2020 at 12:10 PM Michał Górny <mgorny@g.o> wrote:
2
3	> On Thu, 2020-05-21 at 11:48 +0200, Tomas Mozes wrote:
4	> > On Thu, May 21, 2020 at 10:47 AM Michał Górny <mgorny@g.o> wrote:
5	> >
6	> > > Hi,
7	> > >
8	> > > TL;DR: I'm looking for opinions on how to protect goose from spam,
9	> > > i.e. mass fake submissions.
10	> > >
11	> > >
12	> > > Problem
13	> > > =======
14	> > > Goose currently lacks proper limiting of submitted data. The only
15	> > > limiter currently in place is based on unique submitter id that is
16	> > > randomly generated at setup time and in full control of the submitter.
17	> > > This only protects against accidental duplicates but it can't protect
18	> > > against deliberate action.
19	> > >
20	> > > An attacker could easily submit thousands (millions?) of fake entries
21	> by
22	> > > issuing a lot of requests with different ids. Creating them is
23	> > > as trivial as using successive numbers. The potential damage includes:
24	> > >
25	> > > - distorting the metrics to the point of it being useless (even though
26	> > > some people consider it useless by design).
27	> > >
28	> > > - submitting lots of arbitrary data to cause DoS via growing
29	> > > the database until no disk space is left.
30	> > >
31	> > > - blocking large range of valid user ids, causing collisions with
32	> > > legitimate users more likely.
33	> > >
34	> > > I don't think it worthwhile to discuss the motivation for doing so:
35	> > > whether it would be someone wishing harm to Gentoo, disagreeing with
36	> > > the project or merely wanting to try and see if it would work. The
37	> case
38	> > > of SKS keyservers teaches us a lesson that you can't leave holes like
39	> > > this open a long time because someone eventually will abuse them.
40	> > >
41	> > >
42	> > > Option 1: IP-based limiting
43	> > > ===========================
44	> > > The original idea was to set a hard limit of submissions per week based
45	> > > on IP address of the submitter. This has (at least as far as IPv4 is
46	> > > concerned) the advantages that:
47	> > >
48	> > > - submitted has limited control of his IP address (i.e. he can't just
49	> > > submit stuff using arbitrary data)
50	> > >
51	> > > - IP address range is naturally limited
52	> > >
53	> > > - IP addresses have non-zero cost
54	> > >
55	> > > This method could strongly reduce the number of fake submissions one
56	> > > attacker could devise. However, it has a few problems too:
57	> > >
58	> > > - a low limit would harm legitimate submitters sharing IP address
59	> > > (i.e. behind NAT)
60	> > >
61	> > > - it actively favors people with access to large number of IP addresses
62	> > >
63	> > > - it doesn't map cleanly to IPv6 (where some people may have just one
64	> IP
65	> > > address, and others may have whole /64 or /48 ranges)
66	> > >
67	> > > - it may cause problems for anonymizing network users (and we want to
68	> > > encourage Tor usage for privacy)
69	> > >
70	> > > All this considered, IP address limiting can't be used the primary
71	> > > method of preventing fake submissions. However, I suppose it could
72	> work
73	> > > as an additional DoS prevention, limiting the number of submissions
74	> from
75	> > > a single address over short periods of time.
76	> > >
77	> > > Example: if we limit to 10 requests an hour, then a single IP can be
78	> > > used ot manufacture at most 240 submissions a day. This might be
79	> > > sufficient to render them unusable but should keep the database
80	> > > reasonably safe.
81	> > >
82	> > >
83	> > > Option 2: proof-of-work
84	> > > =======================
85	> > > An alternative of using a proof-of-work algorithm was suggested to me
86	> > > yesterday. The idea is that every submission has to be accompanied
87	> with
88	> > > the result of some cumbersome calculation that can't be trivially run
89	> > > in parallel or optimized out to dedicated hardware.
90	> > >
91	> > > On the plus side, it would rely more on actual physical hardware than
92	> IP
93	> > > addresses provided by ISPs. While it would be a waste of CPU time
94	> > > and memory, doing it just once a week wouldn't be that much harm.
95	> > >
96	> > > On the minus side, it would penalize people with weak hardware.
97	> > >
98	> > > For example, 'time hashcash -m -b 28 -r test' gives:
99	> > >
100	> > > - 34 s (-s estimated 38 s) on Ryzen 5 3600
101	> > >
102	> > > - 3 minutes (estimated 92 s) on some old 32-bit Celeron M
103	> > >
104	> > > At the same time, it would still permit a lot of fake submissions. For
105	> > > example, randomx [1] claims to require 2G of memory in fast mode. This
106	> > > would still allow me to use 7 threads. If we adjusted the algorithm to
107	> > > take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k
108	> > > submissions a day.
109	> > >
110	> > > So in the end, while this is interesting, it doesn't seem like
111	> > > a workable anti-spam measure.
112	> > >
113	> > >
114	> > > Option 3: explicit CAPTCHA
115	> > > ==========================
116	> > > A traditional way of dealing with spam -- require every new system
117	> > > identifier to be confirmed by solving a CAPTCHA (or a few identifiers
118	> > > for one CAPTCHA).
119	> > >
120	> > > The advantage of this method is that it requires a real human work to
121	> be
122	> > > performed, effectively limiting the ability to submit spam.
123	> > > The disadvantage is that it is cumbersome to users, so many of them
124	> will
125	> > > just resign from participating.
126	> > >
127	> > >
128	> > > Other ideas
129	> > > ===========
130	> > > Do you have any other ideas on how we could resolve this?
131	> > >
132	> > >
133	> > > [1] https://github.com/tevador/RandomX
134	> > >
135	> > >
136	> > > --
137	> > > Best regards,
138	> > > Michał Górny
139	> > >
140	> >
141	> >
142	> > Sadly, the problem with IP addresses is (in this case), that there are
143	> > anonymous. One can easily start an attack with thousands of IPs (all
144	> around
145	> > the world).
146	> >
147	> > One solution would be to introduce user accounts:
148	> > - one needs to register with an email
149	>
150	> Problem 1: you can trivially mass-create email addresses.
151	>
152	>
153	IP verification:
154	- get enough IPs (botnet) and send your payload
155
156	User verification:
157	- get an email, verify account, solve a captcha
158
159	I know if someone wants to, he'll try to bypass the user verification, but
160	it's just more work to do. We can also enforce IP restrictions and use a
161	combination of both.
162
163	> - you can rate limit based on the client (not the IP)
164	> >
165	> > For example I've 200 servers, I'd create one account, verify my email
166	> > (maybe captcha too) and deploy a config with my token on all servers.
167	> Then
168	> > I'd setup a cron job on every server to submit stats. A token can have
169	> some
170	> > lifetime and you could create a new one when the old is about to expire.
171	> >
172	> > If you discover I'm doing false reports, you'd block all my submissions.
173	> I
174	> > can still do fake submissions, but you'd need a per-host verification to
175	> > avoid that.
176	> >
177	>
178	> Problem 2: we can't really discover this because the goal is to protect
179	> users' privacy. The best we can do is to discover that someone is
180	> submitting a lot from a single account (but are them legitimate?).
181	> But then, we can just block them.
182	>
183	> But in the end, this has the same problem as CAPTCHA -- or maybe it's
184	> even worse. It requires additional effort from the users, effectively
185	> making it less likely for them to participate. Furthermore, it requires
186	> them to submit e-mail addresses which they may consider PII. Even if we
187	> don't store them permanently but just use for initial verification, they
188	> still could choose not to participate.
189	>
190
191	I think if someone wants to participate and believes the cause he will.
192	Many of the users are on bugzila anyway, so the email is on the Gentoo side
193	anyway. Contributors have their emails in each Gentoo commit.
194
195	If spamming is a serious problem you can turn it into an invite-only system
196	creating a chain of trust.
197
198
199	> --
200	> Best regards,
201	> Michał Górny
202	>
203	>

Gentoo Archives: gentoo-dev