Gentoo Archives: gentoo-dev

From: "Michał Górny" <mgorny@g.o>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] [RFC] Anti-spam for goose
Date: Thu, 21 May 2020 10:10:44
Message-Id: f1392b5e2c478d591c4536cdfecd9cb411e15f9e.camel@gentoo.org
In Reply to: Re: [gentoo-dev] [RFC] Anti-spam for goose by Tomas Mozes
1 On Thu, 2020-05-21 at 11:48 +0200, Tomas Mozes wrote:
2 > On Thu, May 21, 2020 at 10:47 AM Michał Górny <mgorny@g.o> wrote:
3 >
4 > > Hi,
5 > >
6 > > TL;DR: I'm looking for opinions on how to protect goose from spam,
7 > > i.e. mass fake submissions.
8 > >
9 > >
10 > > Problem
11 > > =======
12 > > Goose currently lacks proper limiting of submitted data. The only
13 > > limiter currently in place is based on unique submitter id that is
14 > > randomly generated at setup time and in full control of the submitter.
15 > > This only protects against accidental duplicates but it can't protect
16 > > against deliberate action.
17 > >
18 > > An attacker could easily submit thousands (millions?) of fake entries by
19 > > issuing a lot of requests with different ids. Creating them is
20 > > as trivial as using successive numbers. The potential damage includes:
21 > >
22 > > - distorting the metrics to the point of it being useless (even though
23 > > some people consider it useless by design).
24 > >
25 > > - submitting lots of arbitrary data to cause DoS via growing
26 > > the database until no disk space is left.
27 > >
28 > > - blocking large range of valid user ids, causing collisions with
29 > > legitimate users more likely.
30 > >
31 > > I don't think it worthwhile to discuss the motivation for doing so:
32 > > whether it would be someone wishing harm to Gentoo, disagreeing with
33 > > the project or merely wanting to try and see if it would work. The case
34 > > of SKS keyservers teaches us a lesson that you can't leave holes like
35 > > this open a long time because someone eventually will abuse them.
36 > >
37 > >
38 > > Option 1: IP-based limiting
39 > > ===========================
40 > > The original idea was to set a hard limit of submissions per week based
41 > > on IP address of the submitter. This has (at least as far as IPv4 is
42 > > concerned) the advantages that:
43 > >
44 > > - submitted has limited control of his IP address (i.e. he can't just
45 > > submit stuff using arbitrary data)
46 > >
47 > > - IP address range is naturally limited
48 > >
49 > > - IP addresses have non-zero cost
50 > >
51 > > This method could strongly reduce the number of fake submissions one
52 > > attacker could devise. However, it has a few problems too:
53 > >
54 > > - a low limit would harm legitimate submitters sharing IP address
55 > > (i.e. behind NAT)
56 > >
57 > > - it actively favors people with access to large number of IP addresses
58 > >
59 > > - it doesn't map cleanly to IPv6 (where some people may have just one IP
60 > > address, and others may have whole /64 or /48 ranges)
61 > >
62 > > - it may cause problems for anonymizing network users (and we want to
63 > > encourage Tor usage for privacy)
64 > >
65 > > All this considered, IP address limiting can't be used the primary
66 > > method of preventing fake submissions. However, I suppose it could work
67 > > as an additional DoS prevention, limiting the number of submissions from
68 > > a single address over short periods of time.
69 > >
70 > > Example: if we limit to 10 requests an hour, then a single IP can be
71 > > used ot manufacture at most 240 submissions a day. This might be
72 > > sufficient to render them unusable but should keep the database
73 > > reasonably safe.
74 > >
75 > >
76 > > Option 2: proof-of-work
77 > > =======================
78 > > An alternative of using a proof-of-work algorithm was suggested to me
79 > > yesterday. The idea is that every submission has to be accompanied with
80 > > the result of some cumbersome calculation that can't be trivially run
81 > > in parallel or optimized out to dedicated hardware.
82 > >
83 > > On the plus side, it would rely more on actual physical hardware than IP
84 > > addresses provided by ISPs. While it would be a waste of CPU time
85 > > and memory, doing it just once a week wouldn't be that much harm.
86 > >
87 > > On the minus side, it would penalize people with weak hardware.
88 > >
89 > > For example, 'time hashcash -m -b 28 -r test' gives:
90 > >
91 > > - 34 s (-s estimated 38 s) on Ryzen 5 3600
92 > >
93 > > - 3 minutes (estimated 92 s) on some old 32-bit Celeron M
94 > >
95 > > At the same time, it would still permit a lot of fake submissions. For
96 > > example, randomx [1] claims to require 2G of memory in fast mode. This
97 > > would still allow me to use 7 threads. If we adjusted the algorithm to
98 > > take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k
99 > > submissions a day.
100 > >
101 > > So in the end, while this is interesting, it doesn't seem like
102 > > a workable anti-spam measure.
103 > >
104 > >
105 > > Option 3: explicit CAPTCHA
106 > > ==========================
107 > > A traditional way of dealing with spam -- require every new system
108 > > identifier to be confirmed by solving a CAPTCHA (or a few identifiers
109 > > for one CAPTCHA).
110 > >
111 > > The advantage of this method is that it requires a real human work to be
112 > > performed, effectively limiting the ability to submit spam.
113 > > The disadvantage is that it is cumbersome to users, so many of them will
114 > > just resign from participating.
115 > >
116 > >
117 > > Other ideas
118 > > ===========
119 > > Do you have any other ideas on how we could resolve this?
120 > >
121 > >
122 > > [1] https://github.com/tevador/RandomX
123 > >
124 > >
125 > > --
126 > > Best regards,
127 > > Michał Górny
128 > >
129 >
130 >
131 > Sadly, the problem with IP addresses is (in this case), that there are
132 > anonymous. One can easily start an attack with thousands of IPs (all around
133 > the world).
134 >
135 > One solution would be to introduce user accounts:
136 > - one needs to register with an email
137
138 Problem 1: you can trivially mass-create email addresses.
139
140 > - you can rate limit based on the client (not the IP)
141 >
142 > For example I've 200 servers, I'd create one account, verify my email
143 > (maybe captcha too) and deploy a config with my token on all servers. Then
144 > I'd setup a cron job on every server to submit stats. A token can have some
145 > lifetime and you could create a new one when the old is about to expire.
146 >
147 > If you discover I'm doing false reports, you'd block all my submissions. I
148 > can still do fake submissions, but you'd need a per-host verification to
149 > avoid that.
150 >
151
152 Problem 2: we can't really discover this because the goal is to protect
153 users' privacy. The best we can do is to discover that someone is
154 submitting a lot from a single account (but are them legitimate?).
155 But then, we can just block them.
156
157 But in the end, this has the same problem as CAPTCHA -- or maybe it's
158 even worse. It requires additional effort from the users, effectively
159 making it less likely for them to participate. Furthermore, it requires
160 them to submit e-mail addresses which they may consider PII. Even if we
161 don't store them permanently but just use for initial verification, they
162 still could choose not to participate.
163
164 --
165 Best regards,
166 Michał Górny

Attachments

File name MIME type
signature.asc application/pgp-signature

Replies

Subject Author
Re: [gentoo-dev] [RFC] Anti-spam for goose Tomas Mozes <hydrapolic@×××××.com>