1 |
Hi, |
2 |
|
3 |
On 21-05-2020 10:47:07 +0200, Michał Górny wrote: |
4 |
> Hi, |
5 |
> |
6 |
> TL;DR: I'm looking for opinions on how to protect goose from spam, |
7 |
> i.e. mass fake submissions. |
8 |
> |
9 |
> |
10 |
> Problem |
11 |
> ======= |
12 |
> Goose currently lacks proper limiting of submitted data. The only |
13 |
> limiter currently in place is based on unique submitter id that is |
14 |
> randomly generated at setup time and in full control of the submitter. |
15 |
> This only protects against accidental duplicates but it can't protect |
16 |
> against deliberate action. |
17 |
> |
18 |
> An attacker could easily submit thousands (millions?) of fake entries by |
19 |
> issuing a lot of requests with different ids. Creating them is |
20 |
> as trivial as using successive numbers. The potential damage includes: |
21 |
|
22 |
Perhaps you could consider something like a reputation system. I'm |
23 |
thinking of things like only publishing results after X hours when an id |
24 |
is new (graylisting?), and gradually build up "trust" there. In the X |
25 |
hours you could then determine something is potentially fraud if you see |
26 |
a new user id, loads of submissions from same IP, etc. what you describe |
27 |
below I think. |
28 |
|
29 |
The reputation logic could further build on if it appears to follow a |
30 |
norm, e.g. compilation times which fall in the average given the |
31 |
cpu/arch configuration. |
32 |
Another way would be to see submissions for packages that are actually |
33 |
bumped/stabilised in the tree, to score an id as more likely to be |
34 |
genuine. |
35 |
|
36 |
I think it will be a tad complicated, but static limiting might be as |
37 |
easy to circumvent as you block it, as has been pointed out already. |
38 |
|
39 |
Perhaps, it is fruitful to think of the reverse, when is something |
40 |
obviously bad? When a single (obscure?) package is suddenly reported |
41 |
many times by new ids? When a single id generates hundreds or thousands |
42 |
of package submissions (is it a cluster being misconfigured, many |
43 |
identical packages, or what seems to be an a to z scan). |
44 |
Thing is, would a single "fake" submission (that IMO will unlikely be |
45 |
ever noticed) screw up the overall state of things? I think the |
46 |
fuzzyness of the system as a whole should cover for these. It is pure |
47 |
poisioning that should be able to be mitigated, and I agree with you |
48 |
preferably most of it blocked by default. Fact probably is that it will |
49 |
happen nevertheless. |
50 |
|
51 |
That brings me to the thought: are there things that can be done to make |
52 |
sure a fraudulous action can be easily undone or negated somehow? E.g. |
53 |
should a log be kept, or some action to rollback and replay. Sorry to |
54 |
have no concrete examples here. |
55 |
|
56 |
Fabian |
57 |
|
58 |
> |
59 |
> - distorting the metrics to the point of it being useless (even though |
60 |
> some people consider it useless by design). |
61 |
> |
62 |
> - submitting lots of arbitrary data to cause DoS via growing |
63 |
> the database until no disk space is left. |
64 |
> |
65 |
> - blocking large range of valid user ids, causing collisions with |
66 |
> legitimate users more likely. |
67 |
> |
68 |
> I don't think it worthwhile to discuss the motivation for doing so: |
69 |
> whether it would be someone wishing harm to Gentoo, disagreeing with |
70 |
> the project or merely wanting to try and see if it would work. The case |
71 |
> of SKS keyservers teaches us a lesson that you can't leave holes like |
72 |
> this open a long time because someone eventually will abuse them. |
73 |
> |
74 |
> |
75 |
> Option 1: IP-based limiting |
76 |
> =========================== |
77 |
> The original idea was to set a hard limit of submissions per week based |
78 |
> on IP address of the submitter. This has (at least as far as IPv4 is |
79 |
> concerned) the advantages that: |
80 |
> |
81 |
> - submitted has limited control of his IP address (i.e. he can't just |
82 |
> submit stuff using arbitrary data) |
83 |
> |
84 |
> - IP address range is naturally limited |
85 |
> |
86 |
> - IP addresses have non-zero cost |
87 |
> |
88 |
> This method could strongly reduce the number of fake submissions one |
89 |
> attacker could devise. However, it has a few problems too: |
90 |
> |
91 |
> - a low limit would harm legitimate submitters sharing IP address |
92 |
> (i.e. behind NAT) |
93 |
> |
94 |
> - it actively favors people with access to large number of IP addresses |
95 |
> |
96 |
> - it doesn't map cleanly to IPv6 (where some people may have just one IP |
97 |
> address, and others may have whole /64 or /48 ranges) |
98 |
> |
99 |
> - it may cause problems for anonymizing network users (and we want to |
100 |
> encourage Tor usage for privacy) |
101 |
> |
102 |
> All this considered, IP address limiting can't be used the primary |
103 |
> method of preventing fake submissions. However, I suppose it could work |
104 |
> as an additional DoS prevention, limiting the number of submissions from |
105 |
> a single address over short periods of time. |
106 |
> |
107 |
> Example: if we limit to 10 requests an hour, then a single IP can be |
108 |
> used ot manufacture at most 240 submissions a day. This might be |
109 |
> sufficient to render them unusable but should keep the database |
110 |
> reasonably safe. |
111 |
> |
112 |
> |
113 |
> Option 2: proof-of-work |
114 |
> ======================= |
115 |
> An alternative of using a proof-of-work algorithm was suggested to me |
116 |
> yesterday. The idea is that every submission has to be accompanied with |
117 |
> the result of some cumbersome calculation that can't be trivially run |
118 |
> in parallel or optimized out to dedicated hardware. |
119 |
> |
120 |
> On the plus side, it would rely more on actual physical hardware than IP |
121 |
> addresses provided by ISPs. While it would be a waste of CPU time |
122 |
> and memory, doing it just once a week wouldn't be that much harm. |
123 |
> |
124 |
> On the minus side, it would penalize people with weak hardware. |
125 |
> |
126 |
> For example, 'time hashcash -m -b 28 -r test' gives: |
127 |
> |
128 |
> - 34 s (-s estimated 38 s) on Ryzen 5 3600 |
129 |
> |
130 |
> - 3 minutes (estimated 92 s) on some old 32-bit Celeron M |
131 |
> |
132 |
> At the same time, it would still permit a lot of fake submissions. For |
133 |
> example, randomx [1] claims to require 2G of memory in fast mode. This |
134 |
> would still allow me to use 7 threads. If we adjusted the algorithm to |
135 |
> take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k |
136 |
> submissions a day. |
137 |
> |
138 |
> So in the end, while this is interesting, it doesn't seem like |
139 |
> a workable anti-spam measure. |
140 |
> |
141 |
> |
142 |
> Option 3: explicit CAPTCHA |
143 |
> ========================== |
144 |
> A traditional way of dealing with spam -- require every new system |
145 |
> identifier to be confirmed by solving a CAPTCHA (or a few identifiers |
146 |
> for one CAPTCHA). |
147 |
> |
148 |
> The advantage of this method is that it requires a real human work to be |
149 |
> performed, effectively limiting the ability to submit spam. |
150 |
> The disadvantage is that it is cumbersome to users, so many of them will |
151 |
> just resign from participating. |
152 |
> |
153 |
> |
154 |
> Other ideas |
155 |
> =========== |
156 |
> Do you have any other ideas on how we could resolve this? |
157 |
> |
158 |
> |
159 |
> [1] https://github.com/tevador/RandomX |
160 |
> |
161 |
> |
162 |
> -- |
163 |
> Best regards, |
164 |
> Michał Górny |
165 |
> |
166 |
|
167 |
|
168 |
|
169 |
-- |
170 |
Fabian Groffen |
171 |
Gentoo on a different level |