Gentoo Archives: gentoo-mirrors

From: Carlos Carvalho <carlos@×××××××××××.br>
To: gentoo-mirrors@l.g.o
Subject: Re: [gentoo-mirrors] mirror fetch jobs and --checksum
Date: Sat, 06 Feb 2016 13:48:29
Message-Id: 20160206134821.GA15959@fisica.ufpr.br
In Reply to: Re: [gentoo-mirrors] mirror fetch jobs and --checksum by "Robin H. Johnson"
1 Robin H. Johnson (robbat2@g.o) wrote on Fri, Jan 29, 2016 at 03:45:13AM BRST:
2 > On Thu, Jan 28, 2016 at 08:42:33PM -0200, Carlos Carvalho wrote:
3 > > The issue is not calculating checksums, it's I/O. The gentoo repository is now
4 > > 335GB. It's out of question to read it all at every update.
5 > > And you should know it!
6 > The primary module I care about is gentoo-portage (historically
7 > gentoo-x86) is under 400MiB raw (~207k inodes however, so watch your
8 > inode space waste). If your mirror DOESN'T have that entire rsync
9 > module sitting in cache already, you probably have fairly low traffic.
10
11 No, a big mirror like us has tens of repositories and more than 10 million
12 inodes. Gentoo is a small part of it.
13
14 > Hashing that much _was_ a CPU hit a decade ago, and most people did NOT
15 > have the memory to fit it all into cache either.
16
17 For mirroring many repositories the main problem has always been disk I/O,
18 particularly inodes. We know it because we do mirroring for about a decade
19 already.
20
21 > The other issues of mtimes of Manifests esp on ebuild removal have been
22 > resolved by the various series of patches, but part of those were
23 > artificially bumping the mtime by a single second in certain cases.
24 >
25 > Those cases are STILL going to have a bumpy time in some situations
26 > because Git's commit time resolution is only 1 second (ditto many
27 > filesystems). Those situations either need sub-second resolution in the
28 > entire ecosystem or checksums of some form.
29 >
30 > We've had ~27k commits in the last 6 months, of which:
31 > - 4811 colliding (commit timestamp only)
32 > - 45 colliding (author timestamp only)
33 > - 12 colliding (commit timestamp, author timestamp)
34 >
35 > We're using author timestamps on the outgoing rsync files, and
36 > eventually we ARE going to have a real collision that hits users.
37 >
38 > This didn't happen CVS because even if you were really fast (read: local
39 > to the CVS server, and did not go via SSH), you could only get it down
40 > to about 2 seconds per commit, and never in the same package. Two
41 > different devs could never a commit to the same package in the same
42 > second.
43
44 You'll have to deal with it at the repository building stage.
45
46 > > Also, we block --checksum from clients. Most big mirrors do it.
47 > The rsync cached checksums patch needs to get popular again, because
48 > then the mirrors won't have any huge burden at all:
49 > - update the checksums when syncing from the parent repo
50 > - compare against the checksums when queried by the client
51
52 It'd be really nice yes, but unfortunately it's much harder. One has to make
53 sure that the checksums match what's on disk, no matter how the update process
54 is interrupted and what happens upstream. The bulk of it is not difficult but
55 the corner cases are :-( The patch is surely a godsend to the origin of content
56 but not to the destination.
57
58 Anyway, I agree checksums are better, so much so that I DO USE checksums when
59 they're available, like Debian does. I won't use the --checksum option
60 for Gentoo or any other repository but if you provide a file with a
61 list of them at your repository the C3SL mirror will use them. The format of
62 the file should be like the md5/sha* one. These utilities include only regular
63 files, so you also have to provide another file with a list containing all
64 objects in the repository. Please use
65
66 cd /root-of-repository && TZ=UTC rsync --no-h --list-only -r > /path/to/filelist
67
68 to create it, because it's easy to parse.
69
70 md5 is becoming increasingly vulnerable, so the Debian repository maintainers
71 are thinking about using other hashes. It seems sha512 is faster than sha256 on
72 64-bit machines, making it a good option. If you use md5sum the mirror job
73 here is simpler because rsync already does the check; for other hashes it's
74 harder at the mirror side because we have to calculate it after download but
75 the cost is small nowadays. I'm willing to do it and modify our script
76 accordingly.