1 |
On Tue, Dec 18, 2018 at 12:14 PM Andrew Savchenko <bircoph@g.o> |
2 |
wrote: |
3 |
|
4 |
> On Tue, 18 Dec 2018 03:36:14 -0800 Raymond Jennings wrote: |
5 |
> > On Tue, Dec 18, 2018 at 1:56 AM Andrew Savchenko <bircoph@g.o> |
6 |
> wrote: |
7 |
> > > On Sat, 15 Dec 2018 23:15:47 -0500 Alec Warner wrote: |
8 |
> > > > Hi, |
9 |
> > > > |
10 |
> > > > I am currently embarking on a plan to redo our existing rsync[0] |
11 |
> mirror |
12 |
> > > > network. The current network has aged a bit. Its likely too large |
13 |
> and is |
14 |
> > > > under-maintained. I think in the ideal case we would instead pivot |
15 |
> this |
16 |
> > > > project to scaling out our git mirror capabilities and slowly |
17 |
> migrate all |
18 |
> > > > consumers to pulling the git tree directly. To that end, I'm looking |
19 |
> for |
20 |
> > > > blockers as to why various customers cannot switch to pulling the |
21 |
> gentoo |
22 |
> > > > ebuild repository from git[1] instead of rsync. |
23 |
> > > > |
24 |
> > > > So for example: |
25 |
> > > > |
26 |
> > > > - bandwidth concerns (preferably with documentation / data.) |
27 |
> > > > - Firewall concerns |
28 |
> > > > - CPU concerns (e.g. rsync is great for tiny systems?) |
29 |
> > > > - Disk usage for git vs rsync |
30 |
> > > > - Other things i have not thought of. |
31 |
> > > |
32 |
> > > My main concern with git is downlink fault tolerance. If rsync |
33 |
> > > connection is broken, it can be easily restored without much data |
34 |
> > > retransmission. If git download connection is broken, it has to |
35 |
> > > start all over again. So there are cases where rsync will be always |
36 |
> > > much more preferable than git. |
37 |
> > |
38 |
> > Are you talking about in comparison to the initial clone? |
39 |
> > If so, would having the clone default to shallow mitigate this? |
40 |
> > |
41 |
> > For the curious, I ran a benchmark. |
42 |
> > |
43 |
> > With a completely purged /usr/portage: |
44 |
> > |
45 |
> > emerge-webrsync took 30.302s |
46 |
> > emerge-sync (with git clone --depth 1) took 33.902s |
47 |
> > emerge-sync (with regular rsync) took a whoping 1m25.863s |
48 |
> > |
49 |
> > After a fresh sync: |
50 |
> > |
51 |
> > emerge-sync (with regular rsync) took 7.564s |
52 |
> > emerge-sync (with git fetch --depth 1, and after priming the repo with |
53 |
> > a full clone) took 2.086s |
54 |
> > |
55 |
> > |
56 |
> > |
57 |
> > Up front, webrsync seems to be a small winner for initial setups, with |
58 |
> > git clone a close second, and regular rsync is 3 fold worse |
59 |
> > |
60 |
> > Routine syncs would seem to prefer git, especially if they are done |
61 |
> > with presistent regularity which IMO would amortize things. My |
62 |
> > opinion is that over time git would also place less stress on the |
63 |
> > servers since it only has to look at the commit chain instead of |
64 |
> > checksumming every single file. |
65 |
> > |
66 |
> > |
67 |
> > |
68 |
> > That said, would I be correct to surmise that you're advancing a |
69 |
> > robustness issue and not simply a performance issue? |
70 |
> |
71 |
> Yes, my interest here is in robustness, not performance. Sometimes I |
72 |
> have to use unreliable uplink and other users may face the same |
73 |
> problem. |
74 |
> |
75 |
> I agree that in most cases git should be a preferred way to go, but |
76 |
> there are exceptions. So it would be nice to have rsync backup just |
77 |
> in case. |
78 |
|
79 |
|
80 |
> Daily or weekly portage snapshots available via rsync should be a |
81 |
> solution as well. |
82 |
> |
83 |
|
84 |
Two things here. One is that in an ideal world we would run no rsync |
85 |
service and any design should keep that outcome in mind. Operationally we |
86 |
should continue to offer rsync until these types of problems are addressed |
87 |
by the new system. |
88 |
|
89 |
The second is that in this case I think the plan is to, as Robin mentioned, |
90 |
offer "git bundles" that are over raw http and support resume-able |
91 |
downloads. So instead of downloading an "rsync snapshot" you download a git |
92 |
bundle over http. Infra would offer these git bundles in a similar way to |
93 |
existing rsync snapshot offerings[0]. These bundles would be applied to a |
94 |
machine local clone of a git repo. Does this conceptually address your |
95 |
problem? I agree it will be difficult to know outside of actual practical |
96 |
testing. |
97 |
|
98 |
-A |
99 |
|
100 |
[0] http://gentoo.ussg.indiana.edu/snapshots/ is one example of the current |
101 |
system. Instead of tarballs of an 'rsync tree' these would be git |
102 |
bundles[1] that you fetch and apply locally. We would support a worldwide |
103 |
mirror network for these bundles. |
104 |
[1] https://git-scm.com/docs/git-bundle |
105 |
|
106 |
|
107 |
> |
108 |
> Best regards, |
109 |
> Andrew Savchenko |
110 |
> |