Re: [gentoo-user] an efficient idea for an alternative portage synchronisation - gentoo-user

From:	Michael Jones <gentoo@×××××××.com>
To:	gentoo-user@l.g.o
Subject:	Re: [gentoo-user] an efficient idea for an alternative portage synchronisation
Date:	Fri, 18 Jun 2021 14:17:07
Message-Id:	`CABfmKS+kPauptZRObeYiLDYJRMbdRyCYDUmZ5a42v_-r2OnTJA@mail.gmail.com`
In Reply to:	[gentoo-user] an efficient idea for an alternative portage synchronisation by "caveman رَجُلُ الْكَهْفِ 穴居人"

1

On Fri, Jun 18, 2021, 07:10 caveman رَجُلُ الْكَهْفِ 穴居人 <

2

toraboracaveman@××××××××××.com> wrote:

3

4

> tl;dr - i'm suggesting a new file syncing protocol

5

> for portage syncing.  details of this one is in

6

> section 2.

7

>

8

>

9

> 1. background

10

> -------------

11

> rsync needs to read all files in order to compare

12

> them.  this is too expensive and doesn't scale as

13

> portage's tree grows in size..

14

>

15

> on the other hand, git gets away with this, by

16

> maintaining a history of edits.  so git doesn't

17

> need to compare all files, instead it walks

18

> through the history.

19

>

20

> but git has another issue:  the history getting

21

> too big.  this causes:

22

>     - `git clone` to needlessly take too long, as

23

>       many old histories become irrelevant as they

24

>       get fully overwridden by newer ones.

25

>     - this also causes `git pull` to be slower

26

>       than needed, as the history is not ideally

27

>       compressed.

28

>     - plus, the disk space that's wasted for

29

>       histories.

30

>

31

>

32

> 2. new protocol

33

> ---------------

34

> to solve issues above, i think the ideal solution

35

> is this protocol:

36

>     - each history is a number representing a

37

>       logical clock.  1st history is 0, 2nd is 1,

38

>       etc.

39

>     - the server maintains a list of N past many

40

>       histories of the portage tree.

41

>     - when a client requests to update its portage

42

>       tree, it tells the server its current

43

>       history.  e.g. say client is currently

44

>       located in logical time 1234567.

45

>     - the server is maintaining only the past N

46

>       histories:

47

>         - if 1234567 is behind those maintained N

48

>           ones, then the server sends a full

49

>           portage tree from scratch.

50

>         - if 1234567 is within those maintained N

51

>           ones, then the server has two options:

52

>             (1) either send all changes since

53

>                 1234567, as they happened

54

>                 historically.  this is a bad idea.

55

>                 no good reason for it.

56

>

57

>             (2) better: the server can send the

58

>                 compressed histories.  compressed

59

>                 histories are done once, and

60

>                 cached, in a scalable way.  the

61

>                 cache itself is incremental, so

62

>                 updating the cache is cheap

63

>                 (details section 2.2.).

64

>

65

>                 e.g. if there are 5000 histories

66

>                 that the client lacks since time

67

>                 1234567, then there is a chance

68

>                 that many of the changes are just

69

>                 a waste of time.  e.g. add a file,

70

>                 then delete the same file, then

71

>                 add a different file again.  so

72

>                 why not just lie about the

73

>                 history, and send the last file,

74

>                 escaping ones int he middle?  same

75

>                 can be thought about diffs to code

76

>                 blocks.

77

>

78

> 2.1. properties of this new protocol

79

> ------------------------------------

80

> so this new protocol has these properties:

81

>     - unlike rsync, it doesn't need to compare all files

82

>       individually.

83

>     - unlike git, the history doesn't grow on the

84

>       client.  history remains only a single

85

>       number representing a logical clock.

86

>     - the history on the server is limited to N

87

>       past entries.  no devs will cry, because

88

>       this is not a code collaboration app, but

89

>       simply a file synchronisation app to replace

90

>       rsync.  so the admins are free to set N as

91

>       small as they please, without worrying about

92

>       harming collaborating devs.

93

>     - server has the option to compress histories

94

>       to clients, and these histories are

95

>       cacheable for more performance.

96

>

97

>

98

> 2.2. how it will feel to admins/devs

99

> ------------------------------------

100

>     - the devs simply commit their changes to the

101

>       portage tree via git.

102

>     - the git server will have hooks to execute an

103

>       external command for this new protocol, that

104

>       will calculate all diffs necessary in order

105

>       to build a new history.

106

>

107

>       e.g. if current history is 30000, and a dev

108

>       makes a new commit via git, then the git

109

>       hooks will execute the external command to

110

>       calculate the diff for the affected files by

111

>       the git commit, such that history 30001 is

112

>       created.

113

>

114

>       the hooked external command will also see if

115

>       it can compress the histories, for the past

116

>       M many entries since 30001.

117

>

118

>       so that clients that live in time 30001-M,

119

>       who ask for 30001, can get the compressed

120

>       history instead of raw actual histories from

121

>       30001-m to 30001.

122

>

123

> ty,

124

> cm

125

>

126

127

128

It seems like you are almost asking for git's --clone-depth and

129

--sync-depth flags.

130

131

Its not an exact match for your proposal but its very close.

132

133

>

1	On Fri, Jun 18, 2021, 07:10 caveman رَجُلُ الْكَهْفِ 穴居人 <
2	toraboracaveman@××××××××××.com> wrote:
3
4	> tl;dr - i'm suggesting a new file syncing protocol
5	> for portage syncing. details of this one is in
6	> section 2.
7	>
8	>
9	> 1. background
10	> -------------
11	> rsync needs to read all files in order to compare
12	> them. this is too expensive and doesn't scale as
13	> portage's tree grows in size..
14	>
15	> on the other hand, git gets away with this, by
16	> maintaining a history of edits. so git doesn't
17	> need to compare all files, instead it walks
18	> through the history.
19	>
20	> but git has another issue: the history getting
21	> too big. this causes:
22	> - `git clone` to needlessly take too long, as
23	> many old histories become irrelevant as they
24	> get fully overwridden by newer ones.
25	> - this also causes `git pull` to be slower
26	> than needed, as the history is not ideally
27	> compressed.
28	> - plus, the disk space that's wasted for
29	> histories.
30	>
31	>
32	> 2. new protocol
33	> ---------------
34	> to solve issues above, i think the ideal solution
35	> is this protocol:
36	> - each history is a number representing a
37	> logical clock. 1st history is 0, 2nd is 1,
38	> etc.
39	> - the server maintains a list of N past many
40	> histories of the portage tree.
41	> - when a client requests to update its portage
42	> tree, it tells the server its current
43	> history. e.g. say client is currently
44	> located in logical time 1234567.
45	> - the server is maintaining only the past N
46	> histories:
47	> - if 1234567 is behind those maintained N
48	> ones, then the server sends a full
49	> portage tree from scratch.
50	> - if 1234567 is within those maintained N
51	> ones, then the server has two options:
52	> (1) either send all changes since
53	> 1234567, as they happened
54	> historically. this is a bad idea.
55	> no good reason for it.
56	>
57	> (2) better: the server can send the
58	> compressed histories. compressed
59	> histories are done once, and
60	> cached, in a scalable way. the
61	> cache itself is incremental, so
62	> updating the cache is cheap
63	> (details section 2.2.).
64	>
65	> e.g. if there are 5000 histories
66	> that the client lacks since time
67	> 1234567, then there is a chance
68	> that many of the changes are just
69	> a waste of time. e.g. add a file,
70	> then delete the same file, then
71	> add a different file again. so
72	> why not just lie about the
73	> history, and send the last file,
74	> escaping ones int he middle? same
75	> can be thought about diffs to code
76	> blocks.
77	>
78	> 2.1. properties of this new protocol
79	> ------------------------------------
80	> so this new protocol has these properties:
81	> - unlike rsync, it doesn't need to compare all files
82	> individually.
83	> - unlike git, the history doesn't grow on the
84	> client. history remains only a single
85	> number representing a logical clock.
86	> - the history on the server is limited to N
87	> past entries. no devs will cry, because
88	> this is not a code collaboration app, but
89	> simply a file synchronisation app to replace
90	> rsync. so the admins are free to set N as
91	> small as they please, without worrying about
92	> harming collaborating devs.
93	> - server has the option to compress histories
94	> to clients, and these histories are
95	> cacheable for more performance.
96	>
97	>
98	> 2.2. how it will feel to admins/devs
99	> ------------------------------------
100	> - the devs simply commit their changes to the
101	> portage tree via git.
102	> - the git server will have hooks to execute an
103	> external command for this new protocol, that
104	> will calculate all diffs necessary in order
105	> to build a new history.
106	>
107	> e.g. if current history is 30000, and a dev
108	> makes a new commit via git, then the git
109	> hooks will execute the external command to
110	> calculate the diff for the affected files by
111	> the git commit, such that history 30001 is
112	> created.
113	>
114	> the hooked external command will also see if
115	> it can compress the histories, for the past
116	> M many entries since 30001.
117	>
118	> so that clients that live in time 30001-M,
119	> who ask for 30001, can get the compressed
120	> history instead of raw actual histories from
121	> 30001-m to 30001.
122	>
123	> ty,
124	> cm
125	>
126
127
128	It seems like you are almost asking for git's --clone-depth and
129	--sync-depth flags.
130
131	Its not an exact match for your proposal but its very close.
132
133	>

Gentoo Archives: gentoo-user