Re: [gentoo-dev] speeding up emerge sync...and being nice to the mirrors - gentoo-dev

From:	Evan Powers <powers.161@×××.edu>
To:	gentoo-dev@g.o
Subject:	Re: [gentoo-dev] speeding up emerge sync...and being nice to the mirrors
Date:	Thu, 15 May 2003 23:46:18
Message-Id:	`200305151939.36168.powers.161@osu.edu`
In Reply to:	Re: [gentoo-dev] speeding up emerge sync...and being nice to the mirrors by bkhl@privat.utfors.se (=?iso-8859-1?q?Bj=F6rn_Lindstr=F6m?=)

1

On Thursday 15 May 2003 01:10 pm, Björn Lindström wrote:

2

> Stanislav Brabec <utx@g.o> writes:

3

> > Once you download "official tar.gz" (you have to have identical bit

4

> > image; or tar set - to be more patient to  machines with few memory) and

5

> > later download only incremetal deltas.

6

>

7

> Wouldn't this still break pretty easily as soon as you change anything

8

> in your local portage copy?

9

10

Yeah, I'll second this viewpoint. Managing generation and storage of these 

11

xdeltas on the server side and their application on the client side would be 

12

more pain than it's worth, in my opinion.

13

14

I have a more interesting problem to pose, however. I haven't actually worked 

15

out the math to see if it's a practical problem (and I couldn't without 

16

real-world numbers), but it's still sufficiently interesting to post.

17

18

Practically, would switching to xdelta result in /greater/ server load?

19

20

The summary of the following is that rsync has a certain overhead p, while the 

21

overhead of xdelta depends on the minimum period between each xdelta--the 

22

greater the time separation, the smaller the overhead. But people want to 

23

sync with a certain frequency, and /have to/ sync with another frequency. 

24

Presumably the greatest achievable efficiency with xdelta isn't too much 

25

greater than the greatest achievable efficiency with rsync (based on what I 

26

know of the rsync algorithm). Therefore it's quite possible that xdelta has 

27

more overhead at the "want to" and "have to" frequencies than rsync.

28

29

Let's say we have a portage tree A, the official one, and B, some user's. Let 

30

t be time. A(t) is constantly changing, and the user wants his B(t) to always 

31

be approximately equal to A(t) within some error factor.

32

33

Let Dr(A(ta),B(tb)) be the amount of data transferred by rsync between A and 

34

B's locations, and let Dx be defined similarly for xdelta. Lets further make 

35

the simplifying assumption that Dr=(1+p)*Dx, where p has some constant value 

36

when averaged over all users syncing their trees (p stands for percentage).

37

38

To accomplish his within-some-error goal, the user periodically synchronizes 

39

his B(t) with the value of A(t) at that moment. Before the synchronization, 

40

B(tb) = A(t0), where t0 is the present and tb < t0.

41

42

Consider rsync. He starts up one rsync connection, which computes some delta 

43

Dr(A(t0),B(tb)) and transfers it. Now B(t0) = A(t0) with some very small 

44

error, since A(t) constantly evolves.

45

46

Taken in aggregate, the server "spends" 1 connection per sync per person and 

47

Dr bytes of bandwidth.

48

49

Consider xdelta. Say xdeltas are made periodically every T1 and T2 units of 

50

time. If you last synced longer than T2 units of time ago, you have to 

51

download the entire portage tree again.

52

53

He downloads the delta list from somewhere (1 connection). Several things can 

54

now happen:

55

* 0 < t0-tb < T1

56

	he must download on average N1 new T1 xdeltas, at average size S1

57

* T1 < t0-tb < T2

58

	he must revert some of his T1 xdeltas

59

	download 1 new T2 delta at average size S2

60

	download N2 new T1 deltas

61

* T2 < t0-tb

62

	he must download 1 new portage tree at average size S3

63

64

Okay, so the server spends either

65

* 1+N1 connections and N1*S1 bytes

66

* 2+N2 connections and N2*S1+S2 bytes

67

* 2 connection and S3 bytes

68

(ignoring the size of the delta list)

69

70

Say the probabilities of each of these three situations with an arbitrary user 

71

are P1, P2, and P3 respectively.

72

73

Taken in aggregate, the server spends P1*N1+P2*(1+N2)+P3 connections per sync 

74

per person and Dx_r = P1*N1*S1+P2*(N2*S1+S2)+P3*S3 bytes of bandwidth per 

75

sync per person. (Dx_r stands for Dx realized).

76

77

So, when is Dr < Dx_r?

78

79

The trivial solutions:

80

81

1) Disk space is "worth" a lot on the servers. (More under #3.)

82

83

2) Connections are "worth" a lot to the servers.

84

85

3) Appropriately chosen values of P1, P2, and P3 can make Dr < Dx_r. The 

86

solution is to add a T3, T4, ..., Tn until Pn is sufficiently small. But this 

87

might not be feasible, since additional levels of deltas increase the size of 

88

the data each portage tree server must store considerably. (It ought to be 

89

exponential with the number of levels, but I haven't worked that out.) This 

90

probably isn't a major problem, you could store the larger deltas only on the 

91

larger servers.

92

93

The fascinating solution:

94

95

4) Note that Dx_r != Dx, and in fact might be considerably greater. The reason 

96

is that if I change something in the tree and then T1 time later change the 

97

same thing again, there's overlap in two deltas. 2*S1 > S2. Moreover, this 

98

sort of overhead is intrinsic: one delta between two times far apart is 

99

always smaller than many deltas between two times far apart. You want to 

100

compute xdeltas as infrequently as possible, but you don't have that 

101

option--the minimum error between A(t0) and B(t0) can't be too great.

102

103

Rsync's algorithm can always manage Dr=p*Dx, irregardless of the size of the 

104

time difference tb-t0. (Remember Dx is the optimal delta size for that time 

105

difference.)

106

107

To achieve very small errors, you have to make lots of xdeltas with small time 

108

differences. But as the time differences increase, the amount of overlap 

109

increases. So Dx_r becomes a better approximation for Dx as the time 

110

difference tb-t0 increases, and as tb-t0 decreases it becomes increasingly 

111

likely that Dr < Dx_r.

112

113

Stratifying your deltas (i.e., times T1, T2, etc.) can mitigate this 

114

disadvantage, but you pay for that mitigation in nonlinear growth in the 

115

amount of data you have to store on the server as the maximum period of your 

116

deltas increases.

117

118

So, in summary, there's /always/ at least one zero to the rsync overhead minus 

119

xdelta overhead function. Rsync is always better for some regions of real 

120

world situations, and xdelta is always better for others. The question is, 

121

which region is Gentoo in?

122

123

I don't think that question has an obvious answer. It depends on many things, 

124

one of them being whether xdelta is dramatically better than rsync for the 

125

kinds of modifications people make to portage, and another being how much the 

126

disk space on and connections to the portage mirrors are really "worth".

127

128

Evan

129

130

--

131

gentoo-dev@g.o mailing list

1	On Thursday 15 May 2003 01:10 pm, Björn Lindström wrote:
2	> Stanislav Brabec <utx@g.o> writes:
3	> > Once you download "official tar.gz" (you have to have identical bit
4	> > image; or tar set - to be more patient to machines with few memory) and
5	> > later download only incremetal deltas.
6	>
7	> Wouldn't this still break pretty easily as soon as you change anything
8	> in your local portage copy?
9
10	Yeah, I'll second this viewpoint. Managing generation and storage of these
11	xdeltas on the server side and their application on the client side would be
12	more pain than it's worth, in my opinion.
13
14	I have a more interesting problem to pose, however. I haven't actually worked
15	out the math to see if it's a practical problem (and I couldn't without
16	real-world numbers), but it's still sufficiently interesting to post.
17
18	Practically, would switching to xdelta result in /greater/ server load?
19
20	The summary of the following is that rsync has a certain overhead p, while the
21	overhead of xdelta depends on the minimum period between each xdelta--the
22	greater the time separation, the smaller the overhead. But people want to
23	sync with a certain frequency, and /have to/ sync with another frequency.
24	Presumably the greatest achievable efficiency with xdelta isn't too much
25	greater than the greatest achievable efficiency with rsync (based on what I
26	know of the rsync algorithm). Therefore it's quite possible that xdelta has
27	more overhead at the "want to" and "have to" frequencies than rsync.
28
29	Let's say we have a portage tree A, the official one, and B, some user's. Let
30	t be time. A(t) is constantly changing, and the user wants his B(t) to always
31	be approximately equal to A(t) within some error factor.
32
33	Let Dr(A(ta),B(tb)) be the amount of data transferred by rsync between A and
34	B's locations, and let Dx be defined similarly for xdelta. Lets further make
35	the simplifying assumption that Dr=(1+p)*Dx, where p has some constant value
36	when averaged over all users syncing their trees (p stands for percentage).
37
38	To accomplish his within-some-error goal, the user periodically synchronizes
39	his B(t) with the value of A(t) at that moment. Before the synchronization,
40	B(tb) = A(t0), where t0 is the present and tb < t0.
41
42	Consider rsync. He starts up one rsync connection, which computes some delta
43	Dr(A(t0),B(tb)) and transfers it. Now B(t0) = A(t0) with some very small
44	error, since A(t) constantly evolves.
45
46	Taken in aggregate, the server "spends" 1 connection per sync per person and
47	Dr bytes of bandwidth.
48
49	Consider xdelta. Say xdeltas are made periodically every T1 and T2 units of
50	time. If you last synced longer than T2 units of time ago, you have to
51	download the entire portage tree again.
52
53	He downloads the delta list from somewhere (1 connection). Several things can
54	now happen:
55	* 0 < t0-tb < T1
56	he must download on average N1 new T1 xdeltas, at average size S1
57	* T1 < t0-tb < T2
58	he must revert some of his T1 xdeltas
59	download 1 new T2 delta at average size S2
60	download N2 new T1 deltas
61	* T2 < t0-tb
62	he must download 1 new portage tree at average size S3
63
64	Okay, so the server spends either
65	* 1+N1 connections and N1*S1 bytes
66	* 2+N2 connections and N2*S1+S2 bytes
67	* 2 connection and S3 bytes
68	(ignoring the size of the delta list)
69
70	Say the probabilities of each of these three situations with an arbitrary user
71	are P1, P2, and P3 respectively.
72
73	Taken in aggregate, the server spends P1N1+P2(1+N2)+P3 connections per sync
74	per person and Dx_r = P1N1S1+P2(N2S1+S2)+P3*S3 bytes of bandwidth per
75	sync per person. (Dx_r stands for Dx realized).
76
77	So, when is Dr < Dx_r?
78
79	The trivial solutions:
80
81	1) Disk space is "worth" a lot on the servers. (More under #3.)
82
83	2) Connections are "worth" a lot to the servers.
84
85	3) Appropriately chosen values of P1, P2, and P3 can make Dr < Dx_r. The
86	solution is to add a T3, T4, ..., Tn until Pn is sufficiently small. But this
87	might not be feasible, since additional levels of deltas increase the size of
88	the data each portage tree server must store considerably. (It ought to be
89	exponential with the number of levels, but I haven't worked that out.) This
90	probably isn't a major problem, you could store the larger deltas only on the
91	larger servers.
92
93	The fascinating solution:
94
95	4) Note that Dx_r != Dx, and in fact might be considerably greater. The reason
96	is that if I change something in the tree and then T1 time later change the
97	same thing again, there's overlap in two deltas. 2*S1 > S2. Moreover, this
98	sort of overhead is intrinsic: one delta between two times far apart is
99	always smaller than many deltas between two times far apart. You want to
100	compute xdeltas as infrequently as possible, but you don't have that
101	option--the minimum error between A(t0) and B(t0) can't be too great.
102
103	Rsync's algorithm can always manage Dr=p*Dx, irregardless of the size of the
104	time difference tb-t0. (Remember Dx is the optimal delta size for that time
105	difference.)
106
107	To achieve very small errors, you have to make lots of xdeltas with small time
108	differences. But as the time differences increase, the amount of overlap
109	increases. So Dx_r becomes a better approximation for Dx as the time
110	difference tb-t0 increases, and as tb-t0 decreases it becomes increasingly
111	likely that Dr < Dx_r.
112
113	Stratifying your deltas (i.e., times T1, T2, etc.) can mitigate this
114	disadvantage, but you pay for that mitigation in nonlinear growth in the
115	amount of data you have to store on the server as the maximum period of your
116	deltas increases.
117
118	So, in summary, there's /always/ at least one zero to the rsync overhead minus
119	xdelta overhead function. Rsync is always better for some regions of real
120	world situations, and xdelta is always better for others. The question is,
121	which region is Gentoo in?
122
123	I don't think that question has an obvious answer. It depends on many things,
124	one of them being whether xdelta is dramatically better than rsync for the
125	kinds of modifications people make to portage, and another being how much the
126	disk space on and connections to the portage mirrors are really "worth".
127
128	Evan
129
130	--
131	gentoo-dev@g.o mailing list

Gentoo Archives: gentoo-dev