1 |
Robin H. Johnson (robbat2@g.o) wrote on Fri, Jan 29, 2016 at 03:45:13AM BRST: |
2 |
> On Thu, Jan 28, 2016 at 08:42:33PM -0200, Carlos Carvalho wrote: |
3 |
> > The issue is not calculating checksums, it's I/O. The gentoo repository is now |
4 |
> > 335GB. It's out of question to read it all at every update. |
5 |
> > And you should know it! |
6 |
> The primary module I care about is gentoo-portage (historically |
7 |
> gentoo-x86) is under 400MiB raw (~207k inodes however, so watch your |
8 |
> inode space waste). If your mirror DOESN'T have that entire rsync |
9 |
> module sitting in cache already, you probably have fairly low traffic. |
10 |
|
11 |
No, a big mirror like us has tens of repositories and more than 10 million |
12 |
inodes. Gentoo is a small part of it. |
13 |
|
14 |
> Hashing that much _was_ a CPU hit a decade ago, and most people did NOT |
15 |
> have the memory to fit it all into cache either. |
16 |
|
17 |
For mirroring many repositories the main problem has always been disk I/O, |
18 |
particularly inodes. We know it because we do mirroring for about a decade |
19 |
already. |
20 |
|
21 |
> The other issues of mtimes of Manifests esp on ebuild removal have been |
22 |
> resolved by the various series of patches, but part of those were |
23 |
> artificially bumping the mtime by a single second in certain cases. |
24 |
> |
25 |
> Those cases are STILL going to have a bumpy time in some situations |
26 |
> because Git's commit time resolution is only 1 second (ditto many |
27 |
> filesystems). Those situations either need sub-second resolution in the |
28 |
> entire ecosystem or checksums of some form. |
29 |
> |
30 |
> We've had ~27k commits in the last 6 months, of which: |
31 |
> - 4811 colliding (commit timestamp only) |
32 |
> - 45 colliding (author timestamp only) |
33 |
> - 12 colliding (commit timestamp, author timestamp) |
34 |
> |
35 |
> We're using author timestamps on the outgoing rsync files, and |
36 |
> eventually we ARE going to have a real collision that hits users. |
37 |
> |
38 |
> This didn't happen CVS because even if you were really fast (read: local |
39 |
> to the CVS server, and did not go via SSH), you could only get it down |
40 |
> to about 2 seconds per commit, and never in the same package. Two |
41 |
> different devs could never a commit to the same package in the same |
42 |
> second. |
43 |
|
44 |
You'll have to deal with it at the repository building stage. |
45 |
|
46 |
> > Also, we block --checksum from clients. Most big mirrors do it. |
47 |
> The rsync cached checksums patch needs to get popular again, because |
48 |
> then the mirrors won't have any huge burden at all: |
49 |
> - update the checksums when syncing from the parent repo |
50 |
> - compare against the checksums when queried by the client |
51 |
|
52 |
It'd be really nice yes, but unfortunately it's much harder. One has to make |
53 |
sure that the checksums match what's on disk, no matter how the update process |
54 |
is interrupted and what happens upstream. The bulk of it is not difficult but |
55 |
the corner cases are :-( The patch is surely a godsend to the origin of content |
56 |
but not to the destination. |
57 |
|
58 |
Anyway, I agree checksums are better, so much so that I DO USE checksums when |
59 |
they're available, like Debian does. I won't use the --checksum option |
60 |
for Gentoo or any other repository but if you provide a file with a |
61 |
list of them at your repository the C3SL mirror will use them. The format of |
62 |
the file should be like the md5/sha* one. These utilities include only regular |
63 |
files, so you also have to provide another file with a list containing all |
64 |
objects in the repository. Please use |
65 |
|
66 |
cd /root-of-repository && TZ=UTC rsync --no-h --list-only -r > /path/to/filelist |
67 |
|
68 |
to create it, because it's easy to parse. |
69 |
|
70 |
md5 is becoming increasingly vulnerable, so the Debian repository maintainers |
71 |
are thinking about using other hashes. It seems sha512 is faster than sha256 on |
72 |
64-bit machines, making it a good option. If you use md5sum the mirror job |
73 |
here is simpler because rsync already does the check; for other hashes it's |
74 |
harder at the mirror side because we have to calculate it after download but |
75 |
the cost is small nowadays. I'm willing to do it and modify our script |
76 |
accordingly. |