Gentoo Archives: gentoo-dev

From:	Brian Harring <ferringb@g.o>
To:	gentoo-dev@××××××××××××.org
Subject:	Re: [gentoo-dev] Proposal for an alternative portage tree sync method
Date:	Fri, 25 Mar 2005 07:57:24
Message-Id:	`20050325075720.GB30900@freedom.wit.com`
In Reply to:	Re: [gentoo-dev] Proposal for an alternative portage tree sync method by Karl Trygve Kalleberg

1	On Thu, Mar 24, 2005 at 03:11:35PM +0100, Karl Trygve Kalleberg wrote:
2	> 2) Presumably, the CPU load on the server will be a lot better for
3	> zsync scheme than for rsync: the client does _all_ the computation,
4	> server only pushes files. I suspect this will make the rsync servers
5	> bandwidth bound rather than CPU bound, but more testing is required
6	> before we have hard numbers on this.
7
8	Afaik, and infra would be the ones to comment, load on the servers
9	isn't a massive issue at this point. Everything is run out of tmpfs.
10	That said, better solutions are preferable obviously.
11
12	> 3) You'll download only one file (an .ISO) and you can actually just
13	> mount this on /usr/portage (or wherever you want your PORTDIR).
14
15	This part is valid.
16
17	> If you have (g)cloop installed, it may even be mounted over a
18	> compressed loopback. A full ISO of the porttree is ~300MB,
19	> compressed it's ~29MB.
20
21	This part however, isn't. Note the portion of zsync's docs
22	referencing doing compressed segment mapping to uncompressed, and
23	using a modified gzip that restarts segments occasionally to help with
24	this.
25
26	If you have gcloop abusing the same gzip tweak, sure, it'll work,
27	although I gurantee the comp. -> uncomp. mapping is going to add more
28	bandwidth then you'd like (have to get the whole compressed segment,
29	not just the bytes that have changed). If you're not doing any
30	compressed stream resetting/restarting, read below (it gets worse :)
31
32	> 4) It's easy to add more image formats to the server. If you compress
33	> the porttree snapshot into squashfs, the resulting image is
34	> ~22MB, and this may be mounted directly, as recent gentoo-dev-sources
35	> has squashfs support built-in.
36
37	Squashfs is even worse- you lose the compressed -> uncompressd mapping.
38	You change a single byte in a compressed stream, and likely all bytes
39	after that point are now different. So... without an equivalent to the
40	gzip segmenting hack, you're going to pay through the teeth on updates.
41
42	So... this basically is applicable (at this point) to snapshots,
43	since fundamentally that's what it works on. Couple of flaws/issues
44	though.
45	Tarball entries are rounded up to the nearest multiple of 512 for the
46	file size, plus an additional 512 for the tar header. If for the
47	zsync chksum index, you're using blocksizes above (basically) 1kb,
48	you lose the ability to update individual files- actually, you already
49	lost it, because zsync requires two matching blocks, side by side.
50	So that's more bandwidth, beyond just pulling the control file.
51
52	A better solution (imo) at least for the snapshot folk, is doing
53	static delta snapshots. Generate a delta every day, basically.
54
55	So... 4KB control index for zsync, and ignoring all other bandwidth
56	costs (eg, the actual updating), the zsync control file is around
57	750KB. The delta per day for diffball generated patches is around
58	150KB avg- that means the user must have let at least 5 days go by,
59	before there even is the possibility of zsync edging out over doing
60	static deltas.
61
62	For a user who 'syncs' via emerge-webrsync daily, the update is only
63	compressed 150KB avg, 200KB tops. The 4KB control file for zsync is
64	over 700KB- concerns outlined above about the block size being larger
65	then the actual 'quanta' of change basically means the control file
66	should be more fine grained, 2KB fex, or lower. That'll drive the control
67	file's size up even further... and again, this isn't accounting for
68	the actual updates, just the initial data pulled so it can figure
69	out what needs to be updated.
70
71	Despite all issues registered above, I do see a use of a remote
72	sync'ing prog for snapshots- static deltas require that the base
73	'version' be known, so an appropriate patchs can be grabbed.
74	Basically, a 2.6.10->2.6.11 patch, applied against a 2.6.9 tarball
75	isn't going to give you 2.6.11. Static deltas are a heck of a lot
76	more efficient, but it requires a bit more care in setting them up.
77	Basically... say if the webrsync hasn't been ran in a month or so.
78	At some point, from a mirroring standpoint, it probably would be
79	easiest to forget about trying patches, and just go a zsync route.
80
81	In terms of bandwidth, you'd need to find the point where the control
82	file's cost is amoritized, and zsync edges deltas out- to help lower
83	that point, the file being synced really should not be compressed,
84	despite how nifty/easy it sounds, it's only going to jack up the
85	amount of data fetched. So... that's costlier bandwidth wise.
86
87	Personally, I'd think the best solution is having a daily full
88	tarball, and patches for N days back, to patch up to the full version.
89	Using a high estimate, the delay between syncs would have to be well
90	over 2 months for it to be cheaper to grab the full tarball, rather
91	then patches.
92
93	Meanwhile, I'm curious about at what point zdelta matches doing static
94	deltas in terms of # of days between syncs :)
95
96	> 5) The zsync program itself only relies on glibc, though it does not
97	> support https, socks and other fancy stuff.
98	>
99	>
100	> On the downside, as Portage does not have pluggable rsync (at least not
101	> without further patching), you won't be able to do FEATURES="zsync"
102	> emerge sync.
103
104	On a sidenote, SYNC syntax in cvs head is a helluva lot more powerful
105	then the current stable format; adding new formats/URI hooks in is doable.
106
107	If people are after trying to dodge the cost of untarring, and
108	rsync'ing for snapshots, well... you're trying to dodge crappy code,
109	frankly. The algorithm/approach used there is kind of ass backwards.
110
111	There's no reason the intersection of the snapshot's tarball files
112	set, and the set of files in the portdir can't be computed, and
113	all other files ixnayed; then untar directly to the tree.
114
115	That would be quite a bit quicker, mainly since it avoids the temp
116	untaring and rather wasteful rsync call.
117
118	Or... just have the repository module run directly off of the tarball,
119	with an additional pregenerated index of file -> offset. (that's a
120	ways off, but something I intend to try at some point).
121	~harring

Replies

Subject	Author
Re: [gentoo-dev] Proposal for an alternative portage tree sync method	Karl Trygve Kalleberg <karltk@g.o>

Report Message

Find on MARC Find on Google Groups