1 |
On Thu, Jan 28, 2016 at 08:42:33PM -0200, Carlos Carvalho wrote: |
2 |
> The issue is not calculating checksums, it's I/O. The gentoo repository is now |
3 |
> 335GB. It's out of question to read it all at every update. |
4 |
> And you should know it! |
5 |
The primary module I care about is gentoo-portage (historically |
6 |
gentoo-x86) is under 400MiB raw (~207k inodes however, so watch your |
7 |
inode space waste). If your mirror DOESN'T have that entire rsync |
8 |
module sitting in cache already, you probably have fairly low traffic. |
9 |
If it's already sitting in cache, there should be no IO hit. |
10 |
Hashing that much _was_ a CPU hit a decade ago, and most people did NOT |
11 |
have the memory to fit it all into cache either. |
12 |
|
13 |
The other issues of mtimes of Manifests esp on ebuild removal have been |
14 |
resolved by the various series of patches, but part of those were |
15 |
artificially bumping the mtime by a single second in certain cases. |
16 |
|
17 |
Those cases are STILL going to have a bumpy time in some situations |
18 |
because Git's commit time resolution is only 1 second (ditto many |
19 |
filesystems). Those situations either need sub-second resolution in the |
20 |
entire ecosystem or checksums of some form. |
21 |
|
22 |
We've had ~27k commits in the last 6 months, of which: |
23 |
- 4811 colliding (commit timestamp only) |
24 |
- 45 colliding (author timestamp only) |
25 |
- 12 colliding (commit timestamp, author timestamp) |
26 |
|
27 |
We're using author timestamps on the outgoing rsync files, and |
28 |
eventually we ARE going to have a real collision that hits users. |
29 |
|
30 |
This didn't happen CVS because even if you were really fast (read: local |
31 |
to the CVS server, and did not go via SSH), you could only get it down |
32 |
to about 2 seconds per commit, and never in the same package. Two |
33 |
different devs could never a commit to the same package in the same |
34 |
second. |
35 |
|
36 |
> Also, we block --checksum from clients. Most big mirrors do it. |
37 |
The rsync cached checksums patch needs to get popular again, because |
38 |
then the mirrors won't have any huge burden at all: |
39 |
- update the checksums when syncing from the parent repo |
40 |
- compare against the checksums when queried by the client |
41 |
|
42 |
> Concerning storing the checksums, you're asking mirrors to use a patched rsync |
43 |
> version for gentoo? Forget it. |
44 |
Actually, I want upstream rsync to accept the rsync cached checksums |
45 |
patch; it was dropped a few years ago amongst other major changes and |
46 |
never picked up again due to lack of maintainer. |
47 |
|
48 |
If the original stats still hold up, the patch will actually improve |
49 |
performance even without --checksum, as it can save on stat calls on the |
50 |
server side (it caches a lot more than just the checksum). |
51 |
|
52 |
> It's the master job to update the repository as you need. Asking mirrors to |
53 |
> bear an enormous load because you cannot do your job is silly, to put it |
54 |
> mildly. You'll be ignored by big mirrors, as you've been since your first post. |
55 |
I've got at least TWO cases of distfiles, upstream from Gentoo, being |
56 |
screwed up by the relevant authors. |
57 |
|
58 |
These happen infrequently enough that we'll continue to deal with them |
59 |
on an exception basis. |
60 |
|
61 |
There are also known cases where attackers have interfered with mirrors |
62 |
(upstream & distributions); and you can expect that an attacker could |
63 |
REASONABLY set the mtime of a file (if the size & mtime are the same, |
64 |
and you aren't using --checksum in your cronjob to fetch from upstream, |
65 |
the attack is going to persist). Catching this in all of the cases is |
66 |
very difficult to do cheaply in the middle tiers. Cappos [1] included |
67 |
this in his thesis on Attacks on Package Managers. |
68 |
|
69 |
> If you provide a list of checksums, like Debian does, I can use it. However I |
70 |
> know of no other mirror that has such functionality. |
71 |
MirrorBrain, as used by SUSE uses checksums internally (and provides |
72 |
them to end users). It can verify checksums from mirrors. Downside is |
73 |
that you have to run apache (at least on the master, and on some of the |
74 |
mirrors for best results). |
75 |
|
76 |
[1] https://www.cs.arizona.edu/stork/packagemanagersecurity/attacks-on-package-managers.html |
77 |
|
78 |
-- |
79 |
Robin Hugh Johnson |
80 |
Gentoo Linux: Developer, Infrastructure Lead, Foundation Trustee |
81 |
E-Mail : robbat2@g.o |
82 |
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 |