Gentoo Archives: gentoo-user

From: "caveman رَجُلُ الْكَهْفِ 穴居人" <toraboracaveman@××××××××××.com>
To: Gentoo <gentoo-user@l.g.o>
Subject: [gentoo-user] an efficient idea for an alternative portage synchronisation
Date: Fri, 18 Jun 2021 12:10:39
Message-Id: W39UY79gTTnkYBA-829kjiWYRPxelVDQq1r9_DiK-R3zu7I4RbJ7s3l-freaWBbsSk6JnFf5cRFv2L0cc7kaSxJezUyQG3iY4-2i0dNpKpc=@protonmail.com
1 tl;dr - i'm suggesting a new file syncing protocol
2 for portage syncing. details of this one is in
3 section 2.
4
5
6 1. background
7 -------------
8 rsync needs to read all files in order to compare
9 them. this is too expensive and doesn't scale as
10 portage's tree grows in size..
11
12 on the other hand, git gets away with this, by
13 maintaining a history of edits. so git doesn't
14 need to compare all files, instead it walks
15 through the history.
16
17 but git has another issue: the history getting
18 too big. this causes:
19 - `git clone` to needlessly take too long, as
20 many old histories become irrelevant as they
21 get fully overwridden by newer ones.
22 - this also causes `git pull` to be slower
23 than needed, as the history is not ideally
24 compressed.
25 - plus, the disk space that's wasted for
26 histories.
27
28
29 2. new protocol
30 ---------------
31 to solve issues above, i think the ideal solution
32 is this protocol:
33 - each history is a number representing a
34 logical clock. 1st history is 0, 2nd is 1,
35 etc.
36 - the server maintains a list of N past many
37 histories of the portage tree.
38 - when a client requests to update its portage
39 tree, it tells the server its current
40 history. e.g. say client is currently
41 located in logical time 1234567.
42 - the server is maintaining only the past N
43 histories:
44 - if 1234567 is behind those maintained N
45 ones, then the server sends a full
46 portage tree from scratch.
47 - if 1234567 is within those maintained N
48 ones, then the server has two options:
49 (1) either send all changes since
50 1234567, as they happened
51 historically. this is a bad idea.
52 no good reason for it.
53
54 (2) better: the server can send the
55 compressed histories. compressed
56 histories are done once, and
57 cached, in a scalable way. the
58 cache itself is incremental, so
59 updating the cache is cheap
60 (details section 2.2.).
61
62 e.g. if there are 5000 histories
63 that the client lacks since time
64 1234567, then there is a chance
65 that many of the changes are just
66 a waste of time. e.g. add a file,
67 then delete the same file, then
68 add a different file again. so
69 why not just lie about the
70 history, and send the last file,
71 escaping ones int he middle? same
72 can be thought about diffs to code
73 blocks.
74
75 2.1. properties of this new protocol
76 ------------------------------------
77 so this new protocol has these properties:
78 - unlike rsync, it doesn't need to compare all files
79 individually.
80 - unlike git, the history doesn't grow on the
81 client. history remains only a single
82 number representing a logical clock.
83 - the history on the server is limited to N
84 past entries. no devs will cry, because
85 this is not a code collaboration app, but
86 simply a file synchronisation app to replace
87 rsync. so the admins are free to set N as
88 small as they please, without worrying about
89 harming collaborating devs.
90 - server has the option to compress histories
91 to clients, and these histories are
92 cacheable for more performance.
93
94
95 2.2. how it will feel to admins/devs
96 ------------------------------------
97 - the devs simply commit their changes to the
98 portage tree via git.
99 - the git server will have hooks to execute an
100 external command for this new protocol, that
101 will calculate all diffs necessary in order
102 to build a new history.
103
104 e.g. if current history is 30000, and a dev
105 makes a new commit via git, then the git
106 hooks will execute the external command to
107 calculate the diff for the affected files by
108 the git commit, such that history 30001 is
109 created.
110
111 the hooked external command will also see if
112 it can compress the histories, for the past
113 M many entries since 30001.
114
115 so that clients that live in time 30001-M,
116 who ask for 30001, can get the compressed
117 history instead of raw actual histories from
118 30001-m to 30001.
119
120 ty,
121 cm.

Replies

Subject Author
Re: [gentoo-user] an efficient idea for an alternative portage synchronisation Michael Jones <gentoo@×××××××.com>