Gentoo Archives: gentoo-mirrors

From: "Robin H. Johnson" <robbat2@g.o>
To: gentoo-mirrors@l.g.o
Subject: Re: [gentoo-mirrors] mirror fetch jobs and --checksum
Date: Fri, 29 Jan 2016 05:45:18
Message-Id: robbat2-20160129T042139-288883805Z@orbis-terrarum.net
In Reply to: Re: [gentoo-mirrors] mirror fetch jobs and --checksum by Carlos Carvalho
1 On Thu, Jan 28, 2016 at 08:42:33PM -0200, Carlos Carvalho wrote:
2 > The issue is not calculating checksums, it's I/O. The gentoo repository is now
3 > 335GB. It's out of question to read it all at every update.
4 > And you should know it!
5 The primary module I care about is gentoo-portage (historically
6 gentoo-x86) is under 400MiB raw (~207k inodes however, so watch your
7 inode space waste). If your mirror DOESN'T have that entire rsync
8 module sitting in cache already, you probably have fairly low traffic.
9 If it's already sitting in cache, there should be no IO hit.
10 Hashing that much _was_ a CPU hit a decade ago, and most people did NOT
11 have the memory to fit it all into cache either.
12
13 The other issues of mtimes of Manifests esp on ebuild removal have been
14 resolved by the various series of patches, but part of those were
15 artificially bumping the mtime by a single second in certain cases.
16
17 Those cases are STILL going to have a bumpy time in some situations
18 because Git's commit time resolution is only 1 second (ditto many
19 filesystems). Those situations either need sub-second resolution in the
20 entire ecosystem or checksums of some form.
21
22 We've had ~27k commits in the last 6 months, of which:
23 - 4811 colliding (commit timestamp only)
24 - 45 colliding (author timestamp only)
25 - 12 colliding (commit timestamp, author timestamp)
26
27 We're using author timestamps on the outgoing rsync files, and
28 eventually we ARE going to have a real collision that hits users.
29
30 This didn't happen CVS because even if you were really fast (read: local
31 to the CVS server, and did not go via SSH), you could only get it down
32 to about 2 seconds per commit, and never in the same package. Two
33 different devs could never a commit to the same package in the same
34 second.
35
36 > Also, we block --checksum from clients. Most big mirrors do it.
37 The rsync cached checksums patch needs to get popular again, because
38 then the mirrors won't have any huge burden at all:
39 - update the checksums when syncing from the parent repo
40 - compare against the checksums when queried by the client
41
42 > Concerning storing the checksums, you're asking mirrors to use a patched rsync
43 > version for gentoo? Forget it.
44 Actually, I want upstream rsync to accept the rsync cached checksums
45 patch; it was dropped a few years ago amongst other major changes and
46 never picked up again due to lack of maintainer.
47
48 If the original stats still hold up, the patch will actually improve
49 performance even without --checksum, as it can save on stat calls on the
50 server side (it caches a lot more than just the checksum).
51
52 > It's the master job to update the repository as you need. Asking mirrors to
53 > bear an enormous load because you cannot do your job is silly, to put it
54 > mildly. You'll be ignored by big mirrors, as you've been since your first post.
55 I've got at least TWO cases of distfiles, upstream from Gentoo, being
56 screwed up by the relevant authors.
57
58 These happen infrequently enough that we'll continue to deal with them
59 on an exception basis.
60
61 There are also known cases where attackers have interfered with mirrors
62 (upstream & distributions); and you can expect that an attacker could
63 REASONABLY set the mtime of a file (if the size & mtime are the same,
64 and you aren't using --checksum in your cronjob to fetch from upstream,
65 the attack is going to persist). Catching this in all of the cases is
66 very difficult to do cheaply in the middle tiers. Cappos [1] included
67 this in his thesis on Attacks on Package Managers.
68
69 > If you provide a list of checksums, like Debian does, I can use it. However I
70 > know of no other mirror that has such functionality.
71 MirrorBrain, as used by SUSE uses checksums internally (and provides
72 them to end users). It can verify checksums from mirrors. Downside is
73 that you have to run apache (at least on the master, and on some of the
74 mirrors for best results).
75
76 [1] https://www.cs.arizona.edu/stork/packagemanagersecurity/attacks-on-package-managers.html
77
78 --
79 Robin Hugh Johnson
80 Gentoo Linux: Developer, Infrastructure Lead, Foundation Trustee
81 E-Mail : robbat2@g.o
82 GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85

Replies

Subject Author
Re: [gentoo-mirrors] mirror fetch jobs and --checksum Carlos Carvalho <carlos@×××××××××××.br>