1 |
On Thursday 15 May 2003 01:10 pm, Björn Lindström wrote: |
2 |
> Stanislav Brabec <utx@g.o> writes: |
3 |
> > Once you download "official tar.gz" (you have to have identical bit |
4 |
> > image; or tar set - to be more patient to machines with few memory) and |
5 |
> > later download only incremetal deltas. |
6 |
> |
7 |
> Wouldn't this still break pretty easily as soon as you change anything |
8 |
> in your local portage copy? |
9 |
|
10 |
Yeah, I'll second this viewpoint. Managing generation and storage of these |
11 |
xdeltas on the server side and their application on the client side would be |
12 |
more pain than it's worth, in my opinion. |
13 |
|
14 |
I have a more interesting problem to pose, however. I haven't actually worked |
15 |
out the math to see if it's a practical problem (and I couldn't without |
16 |
real-world numbers), but it's still sufficiently interesting to post. |
17 |
|
18 |
Practically, would switching to xdelta result in /greater/ server load? |
19 |
|
20 |
The summary of the following is that rsync has a certain overhead p, while the |
21 |
overhead of xdelta depends on the minimum period between each xdelta--the |
22 |
greater the time separation, the smaller the overhead. But people want to |
23 |
sync with a certain frequency, and /have to/ sync with another frequency. |
24 |
Presumably the greatest achievable efficiency with xdelta isn't too much |
25 |
greater than the greatest achievable efficiency with rsync (based on what I |
26 |
know of the rsync algorithm). Therefore it's quite possible that xdelta has |
27 |
more overhead at the "want to" and "have to" frequencies than rsync. |
28 |
|
29 |
Let's say we have a portage tree A, the official one, and B, some user's. Let |
30 |
t be time. A(t) is constantly changing, and the user wants his B(t) to always |
31 |
be approximately equal to A(t) within some error factor. |
32 |
|
33 |
Let Dr(A(ta),B(tb)) be the amount of data transferred by rsync between A and |
34 |
B's locations, and let Dx be defined similarly for xdelta. Lets further make |
35 |
the simplifying assumption that Dr=(1+p)*Dx, where p has some constant value |
36 |
when averaged over all users syncing their trees (p stands for percentage). |
37 |
|
38 |
To accomplish his within-some-error goal, the user periodically synchronizes |
39 |
his B(t) with the value of A(t) at that moment. Before the synchronization, |
40 |
B(tb) = A(t0), where t0 is the present and tb < t0. |
41 |
|
42 |
Consider rsync. He starts up one rsync connection, which computes some delta |
43 |
Dr(A(t0),B(tb)) and transfers it. Now B(t0) = A(t0) with some very small |
44 |
error, since A(t) constantly evolves. |
45 |
|
46 |
Taken in aggregate, the server "spends" 1 connection per sync per person and |
47 |
Dr bytes of bandwidth. |
48 |
|
49 |
Consider xdelta. Say xdeltas are made periodically every T1 and T2 units of |
50 |
time. If you last synced longer than T2 units of time ago, you have to |
51 |
download the entire portage tree again. |
52 |
|
53 |
He downloads the delta list from somewhere (1 connection). Several things can |
54 |
now happen: |
55 |
* 0 < t0-tb < T1 |
56 |
he must download on average N1 new T1 xdeltas, at average size S1 |
57 |
* T1 < t0-tb < T2 |
58 |
he must revert some of his T1 xdeltas |
59 |
download 1 new T2 delta at average size S2 |
60 |
download N2 new T1 deltas |
61 |
* T2 < t0-tb |
62 |
he must download 1 new portage tree at average size S3 |
63 |
|
64 |
Okay, so the server spends either |
65 |
* 1+N1 connections and N1*S1 bytes |
66 |
* 2+N2 connections and N2*S1+S2 bytes |
67 |
* 2 connection and S3 bytes |
68 |
(ignoring the size of the delta list) |
69 |
|
70 |
Say the probabilities of each of these three situations with an arbitrary user |
71 |
are P1, P2, and P3 respectively. |
72 |
|
73 |
Taken in aggregate, the server spends P1*N1+P2*(1+N2)+P3 connections per sync |
74 |
per person and Dx_r = P1*N1*S1+P2*(N2*S1+S2)+P3*S3 bytes of bandwidth per |
75 |
sync per person. (Dx_r stands for Dx realized). |
76 |
|
77 |
So, when is Dr < Dx_r? |
78 |
|
79 |
The trivial solutions: |
80 |
|
81 |
1) Disk space is "worth" a lot on the servers. (More under #3.) |
82 |
|
83 |
2) Connections are "worth" a lot to the servers. |
84 |
|
85 |
3) Appropriately chosen values of P1, P2, and P3 can make Dr < Dx_r. The |
86 |
solution is to add a T3, T4, ..., Tn until Pn is sufficiently small. But this |
87 |
might not be feasible, since additional levels of deltas increase the size of |
88 |
the data each portage tree server must store considerably. (It ought to be |
89 |
exponential with the number of levels, but I haven't worked that out.) This |
90 |
probably isn't a major problem, you could store the larger deltas only on the |
91 |
larger servers. |
92 |
|
93 |
The fascinating solution: |
94 |
|
95 |
4) Note that Dx_r != Dx, and in fact might be considerably greater. The reason |
96 |
is that if I change something in the tree and then T1 time later change the |
97 |
same thing again, there's overlap in two deltas. 2*S1 > S2. Moreover, this |
98 |
sort of overhead is intrinsic: one delta between two times far apart is |
99 |
always smaller than many deltas between two times far apart. You want to |
100 |
compute xdeltas as infrequently as possible, but you don't have that |
101 |
option--the minimum error between A(t0) and B(t0) can't be too great. |
102 |
|
103 |
Rsync's algorithm can always manage Dr=p*Dx, irregardless of the size of the |
104 |
time difference tb-t0. (Remember Dx is the optimal delta size for that time |
105 |
difference.) |
106 |
|
107 |
To achieve very small errors, you have to make lots of xdeltas with small time |
108 |
differences. But as the time differences increase, the amount of overlap |
109 |
increases. So Dx_r becomes a better approximation for Dx as the time |
110 |
difference tb-t0 increases, and as tb-t0 decreases it becomes increasingly |
111 |
likely that Dr < Dx_r. |
112 |
|
113 |
Stratifying your deltas (i.e., times T1, T2, etc.) can mitigate this |
114 |
disadvantage, but you pay for that mitigation in nonlinear growth in the |
115 |
amount of data you have to store on the server as the maximum period of your |
116 |
deltas increases. |
117 |
|
118 |
So, in summary, there's /always/ at least one zero to the rsync overhead minus |
119 |
xdelta overhead function. Rsync is always better for some regions of real |
120 |
world situations, and xdelta is always better for others. The question is, |
121 |
which region is Gentoo in? |
122 |
|
123 |
I don't think that question has an obvious answer. It depends on many things, |
124 |
one of them being whether xdelta is dramatically better than rsync for the |
125 |
kinds of modifications people make to portage, and another being how much the |
126 |
disk space on and connections to the portage mirrors are really "worth". |
127 |
|
128 |
Evan |
129 |
|
130 |
-- |
131 |
gentoo-dev@g.o mailing list |