Gentoo Archives: gentoo-dev

From: Brian Harring <ferringb@g.o>
To: gentoo-dev@××××××××××××.org
Subject: Re: [gentoo-dev] Proposal for an alternative portage tree sync method
Date: Fri, 25 Mar 2005 07:57:24
Message-Id: 20050325075720.GB30900@freedom.wit.com
In Reply to: Re: [gentoo-dev] Proposal for an alternative portage tree sync method by Karl Trygve Kalleberg
1 On Thu, Mar 24, 2005 at 03:11:35PM +0100, Karl Trygve Kalleberg wrote:
2 > 2) Presumably, the CPU load on the server will be a lot better for
3 > zsync scheme than for rsync: the client does _all_ the computation,
4 > server only pushes files. I suspect this will make the rsync servers
5 > bandwidth bound rather than CPU bound, but more testing is required
6 > before we have hard numbers on this.
7
8 Afaik, and infra would be the ones to comment, load on the servers
9 isn't a massive issue at this point. Everything is run out of tmpfs.
10 That said, better solutions are preferable obviously.
11
12 > 3) You'll download only one file (an .ISO) and you can actually just
13 > mount this on /usr/portage (or wherever you want your PORTDIR).
14
15 This part is valid.
16
17 > If you have (g)cloop installed, it may even be mounted over a
18 > compressed loopback. A full ISO of the porttree is ~300MB,
19 > compressed it's ~29MB.
20
21 This part however, isn't. Note the portion of zsync's docs
22 referencing doing compressed segment mapping to uncompressed, and
23 using a modified gzip that restarts segments occasionally to help with
24 this.
25
26 If you have gcloop abusing the same gzip tweak, sure, it'll work,
27 although I gurantee the comp. -> uncomp. mapping is going to add more
28 bandwidth then you'd like (have to get the whole compressed segment,
29 not just the bytes that have changed). If you're *not* doing any
30 compressed stream resetting/restarting, read below (it gets worse :)
31
32 > 4) It's easy to add more image formats to the server. If you compress
33 > the porttree snapshot into squashfs, the resulting image is
34 > ~22MB, and this may be mounted directly, as recent gentoo-dev-sources
35 > has squashfs support built-in.
36
37 Squashfs is even worse- you lose the compressed -> uncompressd mapping.
38 You change a single byte in a compressed stream, and likely all bytes
39 after that point are now different. So... without an equivalent to the
40 gzip segmenting hack, you're going to pay through the teeth on updates.
41
42 So... this basically is applicable (at this point) to snapshots,
43 since fundamentally that's what it works on. Couple of flaws/issues
44 though.
45 Tarball entries are rounded up to the nearest multiple of 512 for the
46 file size, plus an additional 512 for the tar header. If for the
47 zsync chksum index, you're using blocksizes above (basically) 1kb,
48 you lose the ability to update individual files- actually, you already
49 lost it, because zsync requires two matching blocks, side by side.
50 So that's more bandwidth, beyond just pulling the control file.
51
52 A better solution (imo) at least for the snapshot folk, is doing
53 static delta snapshots. Generate a delta every day, basically.
54
55 So... 4KB control index for zsync, and ignoring all other bandwidth
56 costs (eg, the actual updating), the zsync control file is around
57 750KB. The delta per day for diffball generated patches is around
58 150KB avg- that means the user must have let at *least* 5 days go by,
59 before there even is the possibility of zsync edging out over doing
60 static deltas.
61
62 For a user who 'syncs' via emerge-webrsync daily, the update is only
63 compressed 150KB avg, 200KB tops. The 4KB control file for zsync is
64 over 700KB- concerns outlined above about the block size being larger
65 then the actual 'quanta' of change basically means the control file
66 should be more fine grained, 2KB fex, or lower. That'll drive the control
67 file's size up even further... and again, this isn't accounting for
68 the *actual* updates, just the initial data pulled so it can figure
69 out *what* needs to be updated.
70
71 Despite all issues registered above, I *do* see a use of a remote
72 sync'ing prog for snapshots- static deltas require that the base
73 'version' be known, so an appropriate patchs can be grabbed.
74 Basically, a 2.6.10->2.6.11 patch, applied against a 2.6.9 tarball
75 isn't going to give you 2.6.11. Static deltas are a heck of a lot
76 more efficient, but it requires a bit more care in setting them up.
77 Basically... say if the webrsync hasn't been ran in a month or so.
78 At some point, from a mirroring standpoint, it probably would be
79 easiest to forget about trying patches, and just go a zsync route.
80
81 In terms of bandwidth, you'd need to find the point where the control
82 file's cost is amoritized, and zsync edges deltas out- to help lower
83 that point, the file being synced *really* should not be compressed,
84 despite how nifty/easy it sounds, it's only going to jack up the
85 amount of data fetched. So... that's costlier bandwidth wise.
86
87 Personally, I'd think the best solution is having a daily full
88 tarball, and patches for N days back, to patch up to the full version.
89 Using a high estimate, the delay between syncs would have to be well
90 over 2 months for it to be cheaper to grab the full tarball, rather
91 then patches.
92
93 Meanwhile, I'm curious about at what point zdelta matches doing static
94 deltas in terms of # of days between syncs :)
95
96 > 5) The zsync program itself only relies on glibc, though it does not
97 > support https, socks and other fancy stuff.
98 >
99 >
100 > On the downside, as Portage does not have pluggable rsync (at least not
101 > without further patching), you won't be able to do FEATURES="zsync"
102 > emerge sync.
103
104 On a sidenote, SYNC syntax in cvs head is a helluva lot more powerful
105 then the current stable format; adding new formats/URI hooks in is doable.
106
107 If people are after trying to dodge the cost of untarring, and
108 rsync'ing for snapshots, well... you're trying to dodge crappy code,
109 frankly. The algorithm/approach used there is kind of ass backwards.
110
111 There's no reason the intersection of the snapshot's tarball files
112 set, and the set of files in the portdir can't be computed, and
113 all other files ixnayed; then untar directly to the tree.
114
115 That would be quite a bit quicker, mainly since it avoids the temp
116 untaring and rather wasteful rsync call.
117
118 Or... just have the repository module run directly off of the tarball,
119 with an additional pregenerated index of file -> offset. (that's a
120 ways off, but something I intend to try at some point).
121 ~harring

Replies

Subject Author
Re: [gentoo-dev] Proposal for an alternative portage tree sync method Karl Trygve Kalleberg <karltk@g.o>