Gentoo Archives: gentoo-dev

From: Evan Powers <powers.161@×××.edu>
To: gentoo-dev@g.o
Subject: Re: [gentoo-dev] speeding up emerge sync...and being nice to the mirrors
Date: Thu, 15 May 2003 23:46:18
In Reply to: Re: [gentoo-dev] speeding up emerge sync...and being nice to the mirrors by (=?iso-8859-1?q?Bj=F6rn_Lindstr=F6m?=)
On Thursday 15 May 2003 01:10 pm, Björn Lindström wrote:
> Stanislav Brabec <utx@g.o> writes: > > Once you download "official tar.gz" (you have to have identical bit > > image; or tar set - to be more patient to machines with few memory) and > > later download only incremetal deltas. > > Wouldn't this still break pretty easily as soon as you change anything > in your local portage copy?
Yeah, I'll second this viewpoint. Managing generation and storage of these xdeltas on the server side and their application on the client side would be more pain than it's worth, in my opinion. I have a more interesting problem to pose, however. I haven't actually worked out the math to see if it's a practical problem (and I couldn't without real-world numbers), but it's still sufficiently interesting to post. Practically, would switching to xdelta result in /greater/ server load? The summary of the following is that rsync has a certain overhead p, while the overhead of xdelta depends on the minimum period between each xdelta--the greater the time separation, the smaller the overhead. But people want to sync with a certain frequency, and /have to/ sync with another frequency. Presumably the greatest achievable efficiency with xdelta isn't too much greater than the greatest achievable efficiency with rsync (based on what I know of the rsync algorithm). Therefore it's quite possible that xdelta has more overhead at the "want to" and "have to" frequencies than rsync. Let's say we have a portage tree A, the official one, and B, some user's. Let t be time. A(t) is constantly changing, and the user wants his B(t) to always be approximately equal to A(t) within some error factor. Let Dr(A(ta),B(tb)) be the amount of data transferred by rsync between A and B's locations, and let Dx be defined similarly for xdelta. Lets further make the simplifying assumption that Dr=(1+p)*Dx, where p has some constant value when averaged over all users syncing their trees (p stands for percentage). To accomplish his within-some-error goal, the user periodically synchronizes his B(t) with the value of A(t) at that moment. Before the synchronization, B(tb) = A(t0), where t0 is the present and tb < t0. Consider rsync. He starts up one rsync connection, which computes some delta Dr(A(t0),B(tb)) and transfers it. Now B(t0) = A(t0) with some very small error, since A(t) constantly evolves. Taken in aggregate, the server "spends" 1 connection per sync per person and Dr bytes of bandwidth. Consider xdelta. Say xdeltas are made periodically every T1 and T2 units of time. If you last synced longer than T2 units of time ago, you have to download the entire portage tree again. He downloads the delta list from somewhere (1 connection). Several things can now happen: * 0 < t0-tb < T1 he must download on average N1 new T1 xdeltas, at average size S1 * T1 < t0-tb < T2 he must revert some of his T1 xdeltas download 1 new T2 delta at average size S2 download N2 new T1 deltas * T2 < t0-tb he must download 1 new portage tree at average size S3 Okay, so the server spends either * 1+N1 connections and N1*S1 bytes * 2+N2 connections and N2*S1+S2 bytes * 2 connection and S3 bytes (ignoring the size of the delta list) Say the probabilities of each of these three situations with an arbitrary user are P1, P2, and P3 respectively. Taken in aggregate, the server spends P1*N1+P2*(1+N2)+P3 connections per sync per person and Dx_r = P1*N1*S1+P2*(N2*S1+S2)+P3*S3 bytes of bandwidth per sync per person. (Dx_r stands for Dx realized). So, when is Dr < Dx_r? The trivial solutions: 1) Disk space is "worth" a lot on the servers. (More under #3.) 2) Connections are "worth" a lot to the servers. 3) Appropriately chosen values of P1, P2, and P3 can make Dr < Dx_r. The solution is to add a T3, T4, ..., Tn until Pn is sufficiently small. But this might not be feasible, since additional levels of deltas increase the size of the data each portage tree server must store considerably. (It ought to be exponential with the number of levels, but I haven't worked that out.) This probably isn't a major problem, you could store the larger deltas only on the larger servers. The fascinating solution: 4) Note that Dx_r != Dx, and in fact might be considerably greater. The reason is that if I change something in the tree and then T1 time later change the same thing again, there's overlap in two deltas. 2*S1 > S2. Moreover, this sort of overhead is intrinsic: one delta between two times far apart is always smaller than many deltas between two times far apart. You want to compute xdeltas as infrequently as possible, but you don't have that option--the minimum error between A(t0) and B(t0) can't be too great. Rsync's algorithm can always manage Dr=p*Dx, irregardless of the size of the time difference tb-t0. (Remember Dx is the optimal delta size for that time difference.) To achieve very small errors, you have to make lots of xdeltas with small time differences. But as the time differences increase, the amount of overlap increases. So Dx_r becomes a better approximation for Dx as the time difference tb-t0 increases, and as tb-t0 decreases it becomes increasingly likely that Dr < Dx_r. Stratifying your deltas (i.e., times T1, T2, etc.) can mitigate this disadvantage, but you pay for that mitigation in nonlinear growth in the amount of data you have to store on the server as the maximum period of your deltas increases. So, in summary, there's /always/ at least one zero to the rsync overhead minus xdelta overhead function. Rsync is always better for some regions of real world situations, and xdelta is always better for others. The question is, which region is Gentoo in? I don't think that question has an obvious answer. It depends on many things, one of them being whether xdelta is dramatically better than rsync for the kinds of modifications people make to portage, and another being how much the disk space on and connections to the portage mirrors are really "worth". Evan -- gentoo-dev@g.o mailing list