1 |
Require browser-based interaction to use the service. Do something funky |
2 |
with AJAX so the page can't be properly used with curl or anything so that |
3 |
manual effort is required to get the UUID to submit as. Only allow |
4 |
registered UUIDs, and only allow one submission per day per UUID. |
5 |
Sure, somebody can go to Mechanical Turk and pay a few cents to generate |
6 |
fake submission IDs, but at least you have that tiny deterrent of "I've got |
7 |
to pay 3 cents per spam account :(". |
8 |
|
9 |
Maybe also add some minor tracking to the database if it isn't already |
10 |
there to count submissions over time per UUID, and make the default cron |
11 |
script weekly. If you see some UUID that is submitting at the maximum rate |
12 |
of daily, you may lean towards accusations of spam. |
13 |
|
14 |
On Thu, May 21, 2020 at 3:47 AM Michał Górny <mgorny@g.o> wrote: |
15 |
|
16 |
> Hi, |
17 |
> |
18 |
> TL;DR: I'm looking for opinions on how to protect goose from spam, |
19 |
> i.e. mass fake submissions. |
20 |
> |
21 |
> |
22 |
> Problem |
23 |
> ======= |
24 |
> Goose currently lacks proper limiting of submitted data. The only |
25 |
> limiter currently in place is based on unique submitter id that is |
26 |
> randomly generated at setup time and in full control of the submitter. |
27 |
> This only protects against accidental duplicates but it can't protect |
28 |
> against deliberate action. |
29 |
> |
30 |
> An attacker could easily submit thousands (millions?) of fake entries by |
31 |
> issuing a lot of requests with different ids. Creating them is |
32 |
> as trivial as using successive numbers. The potential damage includes: |
33 |
> |
34 |
> - distorting the metrics to the point of it being useless (even though |
35 |
> some people consider it useless by design). |
36 |
> |
37 |
> - submitting lots of arbitrary data to cause DoS via growing |
38 |
> the database until no disk space is left. |
39 |
> |
40 |
> - blocking large range of valid user ids, causing collisions with |
41 |
> legitimate users more likely. |
42 |
> |
43 |
> I don't think it worthwhile to discuss the motivation for doing so: |
44 |
> whether it would be someone wishing harm to Gentoo, disagreeing with |
45 |
> the project or merely wanting to try and see if it would work. The case |
46 |
> of SKS keyservers teaches us a lesson that you can't leave holes like |
47 |
> this open a long time because someone eventually will abuse them. |
48 |
> |
49 |
> |
50 |
> Option 1: IP-based limiting |
51 |
> =========================== |
52 |
> The original idea was to set a hard limit of submissions per week based |
53 |
> on IP address of the submitter. This has (at least as far as IPv4 is |
54 |
> concerned) the advantages that: |
55 |
> |
56 |
> - submitted has limited control of his IP address (i.e. he can't just |
57 |
> submit stuff using arbitrary data) |
58 |
> |
59 |
> - IP address range is naturally limited |
60 |
> |
61 |
> - IP addresses have non-zero cost |
62 |
> |
63 |
> This method could strongly reduce the number of fake submissions one |
64 |
> attacker could devise. However, it has a few problems too: |
65 |
> |
66 |
> - a low limit would harm legitimate submitters sharing IP address |
67 |
> (i.e. behind NAT) |
68 |
> |
69 |
> - it actively favors people with access to large number of IP addresses |
70 |
> |
71 |
> - it doesn't map cleanly to IPv6 (where some people may have just one IP |
72 |
> address, and others may have whole /64 or /48 ranges) |
73 |
> |
74 |
> - it may cause problems for anonymizing network users (and we want to |
75 |
> encourage Tor usage for privacy) |
76 |
> |
77 |
> All this considered, IP address limiting can't be used the primary |
78 |
> method of preventing fake submissions. However, I suppose it could work |
79 |
> as an additional DoS prevention, limiting the number of submissions from |
80 |
> a single address over short periods of time. |
81 |
> |
82 |
> Example: if we limit to 10 requests an hour, then a single IP can be |
83 |
> used ot manufacture at most 240 submissions a day. This might be |
84 |
> sufficient to render them unusable but should keep the database |
85 |
> reasonably safe. |
86 |
> |
87 |
> |
88 |
> Option 2: proof-of-work |
89 |
> ======================= |
90 |
> An alternative of using a proof-of-work algorithm was suggested to me |
91 |
> yesterday. The idea is that every submission has to be accompanied with |
92 |
> the result of some cumbersome calculation that can't be trivially run |
93 |
> in parallel or optimized out to dedicated hardware. |
94 |
> |
95 |
> On the plus side, it would rely more on actual physical hardware than IP |
96 |
> addresses provided by ISPs. While it would be a waste of CPU time |
97 |
> and memory, doing it just once a week wouldn't be that much harm. |
98 |
> |
99 |
> On the minus side, it would penalize people with weak hardware. |
100 |
> |
101 |
> For example, 'time hashcash -m -b 28 -r test' gives: |
102 |
> |
103 |
> - 34 s (-s estimated 38 s) on Ryzen 5 3600 |
104 |
> |
105 |
> - 3 minutes (estimated 92 s) on some old 32-bit Celeron M |
106 |
> |
107 |
> At the same time, it would still permit a lot of fake submissions. For |
108 |
> example, randomx [1] claims to require 2G of memory in fast mode. This |
109 |
> would still allow me to use 7 threads. If we adjusted the algorithm to |
110 |
> take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k |
111 |
> submissions a day. |
112 |
> |
113 |
> So in the end, while this is interesting, it doesn't seem like |
114 |
> a workable anti-spam measure. |
115 |
> |
116 |
> |
117 |
> Option 3: explicit CAPTCHA |
118 |
> ========================== |
119 |
> A traditional way of dealing with spam -- require every new system |
120 |
> identifier to be confirmed by solving a CAPTCHA (or a few identifiers |
121 |
> for one CAPTCHA). |
122 |
> |
123 |
> The advantage of this method is that it requires a real human work to be |
124 |
> performed, effectively limiting the ability to submit spam. |
125 |
> The disadvantage is that it is cumbersome to users, so many of them will |
126 |
> just resign from participating. |
127 |
> |
128 |
> |
129 |
> Other ideas |
130 |
> =========== |
131 |
> Do you have any other ideas on how we could resolve this? |
132 |
> |
133 |
> |
134 |
> [1] https://github.com/tevador/RandomX |
135 |
> |
136 |
> |
137 |
> -- |
138 |
> Best regards, |
139 |
> Michał Górny |
140 |
> |
141 |
> |