1 |
Hi, |
2 |
|
3 |
TL;DR: I'm looking for opinions on how to protect goose from spam, |
4 |
i.e. mass fake submissions. |
5 |
|
6 |
|
7 |
Problem |
8 |
======= |
9 |
Goose currently lacks proper limiting of submitted data. The only |
10 |
limiter currently in place is based on unique submitter id that is |
11 |
randomly generated at setup time and in full control of the submitter. |
12 |
This only protects against accidental duplicates but it can't protect |
13 |
against deliberate action. |
14 |
|
15 |
An attacker could easily submit thousands (millions?) of fake entries by |
16 |
issuing a lot of requests with different ids. Creating them is |
17 |
as trivial as using successive numbers. The potential damage includes: |
18 |
|
19 |
- distorting the metrics to the point of it being useless (even though |
20 |
some people consider it useless by design). |
21 |
|
22 |
- submitting lots of arbitrary data to cause DoS via growing |
23 |
the database until no disk space is left. |
24 |
|
25 |
- blocking large range of valid user ids, causing collisions with |
26 |
legitimate users more likely. |
27 |
|
28 |
I don't think it worthwhile to discuss the motivation for doing so: |
29 |
whether it would be someone wishing harm to Gentoo, disagreeing with |
30 |
the project or merely wanting to try and see if it would work. The case |
31 |
of SKS keyservers teaches us a lesson that you can't leave holes like |
32 |
this open a long time because someone eventually will abuse them. |
33 |
|
34 |
|
35 |
Option 1: IP-based limiting |
36 |
=========================== |
37 |
The original idea was to set a hard limit of submissions per week based |
38 |
on IP address of the submitter. This has (at least as far as IPv4 is |
39 |
concerned) the advantages that: |
40 |
|
41 |
- submitted has limited control of his IP address (i.e. he can't just |
42 |
submit stuff using arbitrary data) |
43 |
|
44 |
- IP address range is naturally limited |
45 |
|
46 |
- IP addresses have non-zero cost |
47 |
|
48 |
This method could strongly reduce the number of fake submissions one |
49 |
attacker could devise. However, it has a few problems too: |
50 |
|
51 |
- a low limit would harm legitimate submitters sharing IP address |
52 |
(i.e. behind NAT) |
53 |
|
54 |
- it actively favors people with access to large number of IP addresses |
55 |
|
56 |
- it doesn't map cleanly to IPv6 (where some people may have just one IP |
57 |
address, and others may have whole /64 or /48 ranges) |
58 |
|
59 |
- it may cause problems for anonymizing network users (and we want to |
60 |
encourage Tor usage for privacy) |
61 |
|
62 |
All this considered, IP address limiting can't be used the primary |
63 |
method of preventing fake submissions. However, I suppose it could work |
64 |
as an additional DoS prevention, limiting the number of submissions from |
65 |
a single address over short periods of time. |
66 |
|
67 |
Example: if we limit to 10 requests an hour, then a single IP can be |
68 |
used ot manufacture at most 240 submissions a day. This might be |
69 |
sufficient to render them unusable but should keep the database |
70 |
reasonably safe. |
71 |
|
72 |
|
73 |
Option 2: proof-of-work |
74 |
======================= |
75 |
An alternative of using a proof-of-work algorithm was suggested to me |
76 |
yesterday. The idea is that every submission has to be accompanied with |
77 |
the result of some cumbersome calculation that can't be trivially run |
78 |
in parallel or optimized out to dedicated hardware. |
79 |
|
80 |
On the plus side, it would rely more on actual physical hardware than IP |
81 |
addresses provided by ISPs. While it would be a waste of CPU time |
82 |
and memory, doing it just once a week wouldn't be that much harm. |
83 |
|
84 |
On the minus side, it would penalize people with weak hardware. |
85 |
|
86 |
For example, 'time hashcash -m -b 28 -r test' gives: |
87 |
|
88 |
- 34 s (-s estimated 38 s) on Ryzen 5 3600 |
89 |
|
90 |
- 3 minutes (estimated 92 s) on some old 32-bit Celeron M |
91 |
|
92 |
At the same time, it would still permit a lot of fake submissions. For |
93 |
example, randomx [1] claims to require 2G of memory in fast mode. This |
94 |
would still allow me to use 7 threads. If we adjusted the algorithm to |
95 |
take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k |
96 |
submissions a day. |
97 |
|
98 |
So in the end, while this is interesting, it doesn't seem like |
99 |
a workable anti-spam measure. |
100 |
|
101 |
|
102 |
Option 3: explicit CAPTCHA |
103 |
========================== |
104 |
A traditional way of dealing with spam -- require every new system |
105 |
identifier to be confirmed by solving a CAPTCHA (or a few identifiers |
106 |
for one CAPTCHA). |
107 |
|
108 |
The advantage of this method is that it requires a real human work to be |
109 |
performed, effectively limiting the ability to submit spam. |
110 |
The disadvantage is that it is cumbersome to users, so many of them will |
111 |
just resign from participating. |
112 |
|
113 |
|
114 |
Other ideas |
115 |
=========== |
116 |
Do you have any other ideas on how we could resolve this? |
117 |
|
118 |
|
119 |
[1] https://github.com/tevador/RandomX |
120 |
|
121 |
|
122 |
-- |
123 |
Best regards, |
124 |
Michał Górny |