[gentoo-user] an efficient idea for an alternative portage synchronisation - gentoo-user

From:	"caveman رَجُلُ الْكَهْفِ 穴居人" <toraboracaveman@××××××××××.com>
To:	Gentoo <gentoo-user@l.g.o>
Subject:	[gentoo-user] an efficient idea for an alternative portage synchronisation
Date:	Fri, 18 Jun 2021 12:10:39
Message-Id:	`W39UY79gTTnkYBA-829kjiWYRPxelVDQq1r9_DiK-R3zu7I4RbJ7s3l-freaWBbsSk6JnFf5cRFv2L0cc7kaSxJezUyQG3iY4-2i0dNpKpc=@protonmail.com`

1

tl;dr - i'm suggesting a new file syncing protocol

2

for portage syncing.  details of this one is in

3

section 2.

4

5

6

1. background

7

-------------

8

rsync needs to read all files in order to compare

9

them.  this is too expensive and doesn't scale as

10

portage's tree grows in size..

11

12

on the other hand, git gets away with this, by

13

maintaining a history of edits.  so git doesn't

14

need to compare all files, instead it walks

15

through the history.

16

17

but git has another issue:  the history getting

18

too big.  this causes:

19

    - `git clone` to needlessly take too long, as

20

      many old histories become irrelevant as they

21

      get fully overwridden by newer ones.

22

    - this also causes `git pull` to be slower

23

      than needed, as the history is not ideally

24

      compressed.

25

    - plus, the disk space that's wasted for

26

      histories.

27

28

29

2. new protocol

30

---------------

31

to solve issues above, i think the ideal solution

32

is this protocol:

33

    - each history is a number representing a

34

      logical clock.  1st history is 0, 2nd is 1,

35

      etc.

36

    - the server maintains a list of N past many

37

      histories of the portage tree.

38

    - when a client requests to update its portage

39

      tree, it tells the server its current

40

      history.  e.g. say client is currently

41

      located in logical time 1234567.

42

    - the server is maintaining only the past N

43

      histories:

44

        - if 1234567 is behind those maintained N

45

          ones, then the server sends a full

46

          portage tree from scratch.

47

        - if 1234567 is within those maintained N

48

          ones, then the server has two options:

49

            (1) either send all changes since

50

                1234567, as they happened

51

                historically.  this is a bad idea.

52

                no good reason for it.

53

54

            (2) better: the server can send the

55

                compressed histories.  compressed

56

                histories are done once, and

57

                cached, in a scalable way.  the

58

                cache itself is incremental, so

59

                updating the cache is cheap

60

                (details section 2.2.).

61

62

                e.g. if there are 5000 histories

63

                that the client lacks since time

64

                1234567, then there is a chance

65

                that many of the changes are just

66

                a waste of time.  e.g. add a file,

67

                then delete the same file, then

68

                add a different file again.  so

69

                why not just lie about the

70

                history, and send the last file,

71

                escaping ones int he middle?  same

72

                can be thought about diffs to code

73

                blocks.

74

75

2.1. properties of this new protocol

76

------------------------------------

77

so this new protocol has these properties:

78

    - unlike rsync, it doesn't need to compare all files

79

      individually.

80

    - unlike git, the history doesn't grow on the

81

      client.  history remains only a single

82

      number representing a logical clock.

83

    - the history on the server is limited to N

84

      past entries.  no devs will cry, because

85

      this is not a code collaboration app, but

86

      simply a file synchronisation app to replace

87

      rsync.  so the admins are free to set N as

88

      small as they please, without worrying about

89

      harming collaborating devs.

90

    - server has the option to compress histories

91

      to clients, and these histories are

92

      cacheable for more performance.

93

94

95

2.2. how it will feel to admins/devs

96

------------------------------------

97

    - the devs simply commit their changes to the

98

      portage tree via git.

99

    - the git server will have hooks to execute an

100

      external command for this new protocol, that

101

      will calculate all diffs necessary in order

102

      to build a new history.

103

104

      e.g. if current history is 30000, and a dev

105

      makes a new commit via git, then the git

106

      hooks will execute the external command to

107

      calculate the diff for the affected files by

108

      the git commit, such that history 30001 is

109

      created.

110

111

      the hooked external command will also see if

112

      it can compress the histories, for the past

113

      M many entries since 30001.

114

115

      so that clients that live in time 30001-M,

116

      who ask for 30001, can get the compressed

117

      history instead of raw actual histories from

118

      30001-m to 30001.

119

120

ty,

121

cm.

Gentoo Archives: gentoo-user

Replies

1	tl;dr - i'm suggesting a new file syncing protocol
2	for portage syncing. details of this one is in
3	section 2.
4
5
6	1. background
7	-------------
8	rsync needs to read all files in order to compare
9	them. this is too expensive and doesn't scale as
10	portage's tree grows in size..
11
12	on the other hand, git gets away with this, by
13	maintaining a history of edits. so git doesn't
14	need to compare all files, instead it walks
15	through the history.
16
17	but git has another issue: the history getting
18	too big. this causes:
19	- `git clone` to needlessly take too long, as
20	many old histories become irrelevant as they
21	get fully overwridden by newer ones.
22	- this also causes `git pull` to be slower
23	than needed, as the history is not ideally
24	compressed.
25	- plus, the disk space that's wasted for
26	histories.
27
28
29	2. new protocol
30	---------------
31	to solve issues above, i think the ideal solution
32	is this protocol:
33	- each history is a number representing a
34	logical clock. 1st history is 0, 2nd is 1,
35	etc.
36	- the server maintains a list of N past many
37	histories of the portage tree.
38	- when a client requests to update its portage
39	tree, it tells the server its current
40	history. e.g. say client is currently
41	located in logical time 1234567.
42	- the server is maintaining only the past N
43	histories:
44	- if 1234567 is behind those maintained N
45	ones, then the server sends a full
46	portage tree from scratch.
47	- if 1234567 is within those maintained N
48	ones, then the server has two options:
49	(1) either send all changes since
50	1234567, as they happened
51	historically. this is a bad idea.
52	no good reason for it.
53
54	(2) better: the server can send the
55	compressed histories. compressed
56	histories are done once, and
57	cached, in a scalable way. the
58	cache itself is incremental, so
59	updating the cache is cheap
60	(details section 2.2.).
61
62	e.g. if there are 5000 histories
63	that the client lacks since time
64	1234567, then there is a chance
65	that many of the changes are just
66	a waste of time. e.g. add a file,
67	then delete the same file, then
68	add a different file again. so
69	why not just lie about the
70	history, and send the last file,
71	escaping ones int he middle? same
72	can be thought about diffs to code
73	blocks.
74
75	2.1. properties of this new protocol
76	------------------------------------
77	so this new protocol has these properties:
78	- unlike rsync, it doesn't need to compare all files
79	individually.
80	- unlike git, the history doesn't grow on the
81	client. history remains only a single
82	number representing a logical clock.
83	- the history on the server is limited to N
84	past entries. no devs will cry, because
85	this is not a code collaboration app, but
86	simply a file synchronisation app to replace
87	rsync. so the admins are free to set N as
88	small as they please, without worrying about
89	harming collaborating devs.
90	- server has the option to compress histories
91	to clients, and these histories are
92	cacheable for more performance.
93
94
95	2.2. how it will feel to admins/devs
96	------------------------------------
97	- the devs simply commit their changes to the
98	portage tree via git.
99	- the git server will have hooks to execute an
100	external command for this new protocol, that
101	will calculate all diffs necessary in order
102	to build a new history.
103
104	e.g. if current history is 30000, and a dev
105	makes a new commit via git, then the git
106	hooks will execute the external command to
107	calculate the diff for the affected files by
108	the git commit, such that history 30001 is
109	created.
110
111	the hooked external command will also see if
112	it can compress the histories, for the past
113	M many entries since 30001.
114
115	so that clients that live in time 30001-M,
116	who ask for 30001, can get the compressed
117	history instead of raw actual histories from
118	30001-m to 30001.
119
120	ty,
121	cm.