1 |
tl;dr - i'm suggesting a new file syncing protocol |
2 |
for portage syncing. details of this one is in |
3 |
section 2. |
4 |
|
5 |
|
6 |
1. background |
7 |
------------- |
8 |
rsync needs to read all files in order to compare |
9 |
them. this is too expensive and doesn't scale as |
10 |
portage's tree grows in size.. |
11 |
|
12 |
on the other hand, git gets away with this, by |
13 |
maintaining a history of edits. so git doesn't |
14 |
need to compare all files, instead it walks |
15 |
through the history. |
16 |
|
17 |
but git has another issue: the history getting |
18 |
too big. this causes: |
19 |
- `git clone` to needlessly take too long, as |
20 |
many old histories become irrelevant as they |
21 |
get fully overwridden by newer ones. |
22 |
- this also causes `git pull` to be slower |
23 |
than needed, as the history is not ideally |
24 |
compressed. |
25 |
- plus, the disk space that's wasted for |
26 |
histories. |
27 |
|
28 |
|
29 |
2. new protocol |
30 |
--------------- |
31 |
to solve issues above, i think the ideal solution |
32 |
is this protocol: |
33 |
- each history is a number representing a |
34 |
logical clock. 1st history is 0, 2nd is 1, |
35 |
etc. |
36 |
- the server maintains a list of N past many |
37 |
histories of the portage tree. |
38 |
- when a client requests to update its portage |
39 |
tree, it tells the server its current |
40 |
history. e.g. say client is currently |
41 |
located in logical time 1234567. |
42 |
- the server is maintaining only the past N |
43 |
histories: |
44 |
- if 1234567 is behind those maintained N |
45 |
ones, then the server sends a full |
46 |
portage tree from scratch. |
47 |
- if 1234567 is within those maintained N |
48 |
ones, then the server has two options: |
49 |
(1) either send all changes since |
50 |
1234567, as they happened |
51 |
historically. this is a bad idea. |
52 |
no good reason for it. |
53 |
|
54 |
(2) better: the server can send the |
55 |
compressed histories. compressed |
56 |
histories are done once, and |
57 |
cached, in a scalable way. the |
58 |
cache itself is incremental, so |
59 |
updating the cache is cheap |
60 |
(details section 2.2.). |
61 |
|
62 |
e.g. if there are 5000 histories |
63 |
that the client lacks since time |
64 |
1234567, then there is a chance |
65 |
that many of the changes are just |
66 |
a waste of time. e.g. add a file, |
67 |
then delete the same file, then |
68 |
add a different file again. so |
69 |
why not just lie about the |
70 |
history, and send the last file, |
71 |
escaping ones int he middle? same |
72 |
can be thought about diffs to code |
73 |
blocks. |
74 |
|
75 |
2.1. properties of this new protocol |
76 |
------------------------------------ |
77 |
so this new protocol has these properties: |
78 |
- unlike rsync, it doesn't need to compare all files |
79 |
individually. |
80 |
- unlike git, the history doesn't grow on the |
81 |
client. history remains only a single |
82 |
number representing a logical clock. |
83 |
- the history on the server is limited to N |
84 |
past entries. no devs will cry, because |
85 |
this is not a code collaboration app, but |
86 |
simply a file synchronisation app to replace |
87 |
rsync. so the admins are free to set N as |
88 |
small as they please, without worrying about |
89 |
harming collaborating devs. |
90 |
- server has the option to compress histories |
91 |
to clients, and these histories are |
92 |
cacheable for more performance. |
93 |
|
94 |
|
95 |
2.2. how it will feel to admins/devs |
96 |
------------------------------------ |
97 |
- the devs simply commit their changes to the |
98 |
portage tree via git. |
99 |
- the git server will have hooks to execute an |
100 |
external command for this new protocol, that |
101 |
will calculate all diffs necessary in order |
102 |
to build a new history. |
103 |
|
104 |
e.g. if current history is 30000, and a dev |
105 |
makes a new commit via git, then the git |
106 |
hooks will execute the external command to |
107 |
calculate the diff for the affected files by |
108 |
the git commit, such that history 30001 is |
109 |
created. |
110 |
|
111 |
the hooked external command will also see if |
112 |
it can compress the histories, for the past |
113 |
M many entries since 30001. |
114 |
|
115 |
so that clients that live in time 30001-M, |
116 |
who ask for 30001, can get the compressed |
117 |
history instead of raw actual histories from |
118 |
30001-m to 30001. |
119 |
|
120 |
ty, |
121 |
cm. |