1 |
On Thu, May 21, 2020 at 12:10 PM Michał Górny <mgorny@g.o> wrote: |
2 |
|
3 |
> On Thu, 2020-05-21 at 11:48 +0200, Tomas Mozes wrote: |
4 |
> > On Thu, May 21, 2020 at 10:47 AM Michał Górny <mgorny@g.o> wrote: |
5 |
> > |
6 |
> > > Hi, |
7 |
> > > |
8 |
> > > TL;DR: I'm looking for opinions on how to protect goose from spam, |
9 |
> > > i.e. mass fake submissions. |
10 |
> > > |
11 |
> > > |
12 |
> > > Problem |
13 |
> > > ======= |
14 |
> > > Goose currently lacks proper limiting of submitted data. The only |
15 |
> > > limiter currently in place is based on unique submitter id that is |
16 |
> > > randomly generated at setup time and in full control of the submitter. |
17 |
> > > This only protects against accidental duplicates but it can't protect |
18 |
> > > against deliberate action. |
19 |
> > > |
20 |
> > > An attacker could easily submit thousands (millions?) of fake entries |
21 |
> by |
22 |
> > > issuing a lot of requests with different ids. Creating them is |
23 |
> > > as trivial as using successive numbers. The potential damage includes: |
24 |
> > > |
25 |
> > > - distorting the metrics to the point of it being useless (even though |
26 |
> > > some people consider it useless by design). |
27 |
> > > |
28 |
> > > - submitting lots of arbitrary data to cause DoS via growing |
29 |
> > > the database until no disk space is left. |
30 |
> > > |
31 |
> > > - blocking large range of valid user ids, causing collisions with |
32 |
> > > legitimate users more likely. |
33 |
> > > |
34 |
> > > I don't think it worthwhile to discuss the motivation for doing so: |
35 |
> > > whether it would be someone wishing harm to Gentoo, disagreeing with |
36 |
> > > the project or merely wanting to try and see if it would work. The |
37 |
> case |
38 |
> > > of SKS keyservers teaches us a lesson that you can't leave holes like |
39 |
> > > this open a long time because someone eventually will abuse them. |
40 |
> > > |
41 |
> > > |
42 |
> > > Option 1: IP-based limiting |
43 |
> > > =========================== |
44 |
> > > The original idea was to set a hard limit of submissions per week based |
45 |
> > > on IP address of the submitter. This has (at least as far as IPv4 is |
46 |
> > > concerned) the advantages that: |
47 |
> > > |
48 |
> > > - submitted has limited control of his IP address (i.e. he can't just |
49 |
> > > submit stuff using arbitrary data) |
50 |
> > > |
51 |
> > > - IP address range is naturally limited |
52 |
> > > |
53 |
> > > - IP addresses have non-zero cost |
54 |
> > > |
55 |
> > > This method could strongly reduce the number of fake submissions one |
56 |
> > > attacker could devise. However, it has a few problems too: |
57 |
> > > |
58 |
> > > - a low limit would harm legitimate submitters sharing IP address |
59 |
> > > (i.e. behind NAT) |
60 |
> > > |
61 |
> > > - it actively favors people with access to large number of IP addresses |
62 |
> > > |
63 |
> > > - it doesn't map cleanly to IPv6 (where some people may have just one |
64 |
> IP |
65 |
> > > address, and others may have whole /64 or /48 ranges) |
66 |
> > > |
67 |
> > > - it may cause problems for anonymizing network users (and we want to |
68 |
> > > encourage Tor usage for privacy) |
69 |
> > > |
70 |
> > > All this considered, IP address limiting can't be used the primary |
71 |
> > > method of preventing fake submissions. However, I suppose it could |
72 |
> work |
73 |
> > > as an additional DoS prevention, limiting the number of submissions |
74 |
> from |
75 |
> > > a single address over short periods of time. |
76 |
> > > |
77 |
> > > Example: if we limit to 10 requests an hour, then a single IP can be |
78 |
> > > used ot manufacture at most 240 submissions a day. This might be |
79 |
> > > sufficient to render them unusable but should keep the database |
80 |
> > > reasonably safe. |
81 |
> > > |
82 |
> > > |
83 |
> > > Option 2: proof-of-work |
84 |
> > > ======================= |
85 |
> > > An alternative of using a proof-of-work algorithm was suggested to me |
86 |
> > > yesterday. The idea is that every submission has to be accompanied |
87 |
> with |
88 |
> > > the result of some cumbersome calculation that can't be trivially run |
89 |
> > > in parallel or optimized out to dedicated hardware. |
90 |
> > > |
91 |
> > > On the plus side, it would rely more on actual physical hardware than |
92 |
> IP |
93 |
> > > addresses provided by ISPs. While it would be a waste of CPU time |
94 |
> > > and memory, doing it just once a week wouldn't be that much harm. |
95 |
> > > |
96 |
> > > On the minus side, it would penalize people with weak hardware. |
97 |
> > > |
98 |
> > > For example, 'time hashcash -m -b 28 -r test' gives: |
99 |
> > > |
100 |
> > > - 34 s (-s estimated 38 s) on Ryzen 5 3600 |
101 |
> > > |
102 |
> > > - 3 minutes (estimated 92 s) on some old 32-bit Celeron M |
103 |
> > > |
104 |
> > > At the same time, it would still permit a lot of fake submissions. For |
105 |
> > > example, randomx [1] claims to require 2G of memory in fast mode. This |
106 |
> > > would still allow me to use 7 threads. If we adjusted the algorithm to |
107 |
> > > take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k |
108 |
> > > submissions a day. |
109 |
> > > |
110 |
> > > So in the end, while this is interesting, it doesn't seem like |
111 |
> > > a workable anti-spam measure. |
112 |
> > > |
113 |
> > > |
114 |
> > > Option 3: explicit CAPTCHA |
115 |
> > > ========================== |
116 |
> > > A traditional way of dealing with spam -- require every new system |
117 |
> > > identifier to be confirmed by solving a CAPTCHA (or a few identifiers |
118 |
> > > for one CAPTCHA). |
119 |
> > > |
120 |
> > > The advantage of this method is that it requires a real human work to |
121 |
> be |
122 |
> > > performed, effectively limiting the ability to submit spam. |
123 |
> > > The disadvantage is that it is cumbersome to users, so many of them |
124 |
> will |
125 |
> > > just resign from participating. |
126 |
> > > |
127 |
> > > |
128 |
> > > Other ideas |
129 |
> > > =========== |
130 |
> > > Do you have any other ideas on how we could resolve this? |
131 |
> > > |
132 |
> > > |
133 |
> > > [1] https://github.com/tevador/RandomX |
134 |
> > > |
135 |
> > > |
136 |
> > > -- |
137 |
> > > Best regards, |
138 |
> > > Michał Górny |
139 |
> > > |
140 |
> > |
141 |
> > |
142 |
> > Sadly, the problem with IP addresses is (in this case), that there are |
143 |
> > anonymous. One can easily start an attack with thousands of IPs (all |
144 |
> around |
145 |
> > the world). |
146 |
> > |
147 |
> > One solution would be to introduce user accounts: |
148 |
> > - one needs to register with an email |
149 |
> |
150 |
> Problem 1: you can trivially mass-create email addresses. |
151 |
> |
152 |
> |
153 |
IP verification: |
154 |
- get enough IPs (botnet) and send your payload |
155 |
|
156 |
User verification: |
157 |
- get an email, verify account, solve a captcha |
158 |
|
159 |
I know if someone wants to, he'll try to bypass the user verification, but |
160 |
it's just more work to do. We can also enforce IP restrictions and use a |
161 |
combination of both. |
162 |
|
163 |
> - you can rate limit based on the client (not the IP) |
164 |
> > |
165 |
> > For example I've 200 servers, I'd create one account, verify my email |
166 |
> > (maybe captcha too) and deploy a config with my token on all servers. |
167 |
> Then |
168 |
> > I'd setup a cron job on every server to submit stats. A token can have |
169 |
> some |
170 |
> > lifetime and you could create a new one when the old is about to expire. |
171 |
> > |
172 |
> > If you discover I'm doing false reports, you'd block all my submissions. |
173 |
> I |
174 |
> > can still do fake submissions, but you'd need a per-host verification to |
175 |
> > avoid that. |
176 |
> > |
177 |
> |
178 |
> Problem 2: we can't really discover this because the goal is to protect |
179 |
> users' privacy. The best we can do is to discover that someone is |
180 |
> submitting a lot from a single account (but are them legitimate?). |
181 |
> But then, we can just block them. |
182 |
> |
183 |
> But in the end, this has the same problem as CAPTCHA -- or maybe it's |
184 |
> even worse. It requires additional effort from the users, effectively |
185 |
> making it less likely for them to participate. Furthermore, it requires |
186 |
> them to submit e-mail addresses which they may consider PII. Even if we |
187 |
> don't store them permanently but just use for initial verification, they |
188 |
> still could choose not to participate. |
189 |
> |
190 |
|
191 |
I think if someone wants to participate and believes the cause he will. |
192 |
Many of the users are on bugzila anyway, so the email is on the Gentoo side |
193 |
anyway. Contributors have their emails in each Gentoo commit. |
194 |
|
195 |
If spamming is a serious problem you can turn it into an invite-only system |
196 |
creating a chain of trust. |
197 |
|
198 |
|
199 |
> -- |
200 |
> Best regards, |
201 |
> Michał Górny |
202 |
> |
203 |
> |