Gentoo Archives: gentoo-user

From: Michael Jones <gentoo@×××××××.com>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] an efficient idea for an alternative portage synchronisation
Date: Fri, 18 Jun 2021 14:17:07
Message-Id: CABfmKS+kPauptZRObeYiLDYJRMbdRyCYDUmZ5a42v_-r2OnTJA@mail.gmail.com
In Reply to: [gentoo-user] an efficient idea for an alternative portage synchronisation by "caveman رَجُلُ الْكَهْفِ 穴居人"
1 On Fri, Jun 18, 2021, 07:10 caveman رَجُلُ الْكَهْفِ 穴居人 <
2 toraboracaveman@××××××××××.com> wrote:
3
4 > tl;dr - i'm suggesting a new file syncing protocol
5 > for portage syncing. details of this one is in
6 > section 2.
7 >
8 >
9 > 1. background
10 > -------------
11 > rsync needs to read all files in order to compare
12 > them. this is too expensive and doesn't scale as
13 > portage's tree grows in size..
14 >
15 > on the other hand, git gets away with this, by
16 > maintaining a history of edits. so git doesn't
17 > need to compare all files, instead it walks
18 > through the history.
19 >
20 > but git has another issue: the history getting
21 > too big. this causes:
22 > - `git clone` to needlessly take too long, as
23 > many old histories become irrelevant as they
24 > get fully overwridden by newer ones.
25 > - this also causes `git pull` to be slower
26 > than needed, as the history is not ideally
27 > compressed.
28 > - plus, the disk space that's wasted for
29 > histories.
30 >
31 >
32 > 2. new protocol
33 > ---------------
34 > to solve issues above, i think the ideal solution
35 > is this protocol:
36 > - each history is a number representing a
37 > logical clock. 1st history is 0, 2nd is 1,
38 > etc.
39 > - the server maintains a list of N past many
40 > histories of the portage tree.
41 > - when a client requests to update its portage
42 > tree, it tells the server its current
43 > history. e.g. say client is currently
44 > located in logical time 1234567.
45 > - the server is maintaining only the past N
46 > histories:
47 > - if 1234567 is behind those maintained N
48 > ones, then the server sends a full
49 > portage tree from scratch.
50 > - if 1234567 is within those maintained N
51 > ones, then the server has two options:
52 > (1) either send all changes since
53 > 1234567, as they happened
54 > historically. this is a bad idea.
55 > no good reason for it.
56 >
57 > (2) better: the server can send the
58 > compressed histories. compressed
59 > histories are done once, and
60 > cached, in a scalable way. the
61 > cache itself is incremental, so
62 > updating the cache is cheap
63 > (details section 2.2.).
64 >
65 > e.g. if there are 5000 histories
66 > that the client lacks since time
67 > 1234567, then there is a chance
68 > that many of the changes are just
69 > a waste of time. e.g. add a file,
70 > then delete the same file, then
71 > add a different file again. so
72 > why not just lie about the
73 > history, and send the last file,
74 > escaping ones int he middle? same
75 > can be thought about diffs to code
76 > blocks.
77 >
78 > 2.1. properties of this new protocol
79 > ------------------------------------
80 > so this new protocol has these properties:
81 > - unlike rsync, it doesn't need to compare all files
82 > individually.
83 > - unlike git, the history doesn't grow on the
84 > client. history remains only a single
85 > number representing a logical clock.
86 > - the history on the server is limited to N
87 > past entries. no devs will cry, because
88 > this is not a code collaboration app, but
89 > simply a file synchronisation app to replace
90 > rsync. so the admins are free to set N as
91 > small as they please, without worrying about
92 > harming collaborating devs.
93 > - server has the option to compress histories
94 > to clients, and these histories are
95 > cacheable for more performance.
96 >
97 >
98 > 2.2. how it will feel to admins/devs
99 > ------------------------------------
100 > - the devs simply commit their changes to the
101 > portage tree via git.
102 > - the git server will have hooks to execute an
103 > external command for this new protocol, that
104 > will calculate all diffs necessary in order
105 > to build a new history.
106 >
107 > e.g. if current history is 30000, and a dev
108 > makes a new commit via git, then the git
109 > hooks will execute the external command to
110 > calculate the diff for the affected files by
111 > the git commit, such that history 30001 is
112 > created.
113 >
114 > the hooked external command will also see if
115 > it can compress the histories, for the past
116 > M many entries since 30001.
117 >
118 > so that clients that live in time 30001-M,
119 > who ask for 30001, can get the compressed
120 > history instead of raw actual histories from
121 > 30001-m to 30001.
122 >
123 > ty,
124 > cm
125 >
126
127
128 It seems like you are almost asking for git's --clone-depth and
129 --sync-depth flags.
130
131 Its not an exact match for your proposal but its very close.
132
133 >