Gentoo Archives: gentoo-dev

From: Karl Trygve Kalleberg <karltk@g.o>
To: gentoo-dev@××××××××××××.org
Subject: Re: [gentoo-dev] Proposal for an alternative portage tree sync method
Date: Sat, 26 Mar 2005 12:47:23
Message-Id: 42455987.1000505@gentoo.org
In Reply to: Re: [gentoo-dev] Proposal for an alternative portage tree sync method by Brian Harring
1 Brian Harring wrote:
2 > On Thu, Mar 24, 2005 at 03:11:35PM +0100, Karl Trygve Kalleberg wrote:
3
4
5 >> If you have (g)cloop installed, it may even be mounted over a
6 >> compressed loopback. A full ISO of the porttree is ~300MB,
7 >> compressed it's ~29MB.
8 >
9 >
10 > This part however, isn't. Note the portion of zsync's docs
11 > referencing doing compressed segment mapping to uncompressed, and
12 > using a modified gzip that restarts segments occasionally to help with
13 > this.
14 >
15 > If you have gcloop abusing the same gzip tweak, sure, it'll work,
16 > although I gurantee the comp. -> uncomp. mapping is going to add more
17 > bandwidth then you'd like (have to get the whole compressed segment,
18 > not just the bytes that have changed). If you're *not* doing any
19 > compressed stream resetting/restarting, read below (it gets worse :)
20
21 > Squashfs is even worse- you lose the compressed -> uncompressd mapping.
22 > You change a single byte in a compressed stream, and likely all bytes
23 > after that point are now different. So... without an equivalent to the
24 > gzip segmenting hack, you're going to pay through the teeth on updates.
25
26 Yeah, we noticed that a zsync of a modified squashfs image requires ~50%
27 of the new file to be downloaded. Not exactly proportional to the change.
28
29
30 > So... this basically is applicable (at this point) to snapshots,
31 > since fundamentally that's what it works on. Couple of flaws/issues
32 > though.
33 > Tarball entries are rounded up to the nearest multiple of 512 for the
34 > file size, plus an additional 512 for the tar header. If for the
35 > zsync chksum index, you're using blocksizes above (basically) 1kb,
36 > you lose the ability to update individual files- actually, you already
37 > lost it, because zsync requires two matching blocks, side by side.
38 > So that's more bandwidth, beyond just pulling the control file.
39
40 Actually, packing the tree in squashfs _without_ compression, shaved
41 about 800bytes per file. Having a tarball of the porttree is obviously
42 plain stupid, as the overhead about as big as the content itself.
43
44 > On a sidenote, SYNC syntax in cvs head is a helluva lot more powerful
45 > then the current stable format; adding new formats/URI hooks in is doable.
46 >
47 > If people are after trying to dodge the cost of untarring, and
48 > rsync'ing for snapshots, well... you're trying to dodge crappy code,
49 > frankly. The algorithm/approach used there is kind of ass backwards.
50 >
51 > There's no reason the intersection of the snapshot's tarball files
52 > set, and the set of files in the portdir can't be computed, and
53 > all other files ixnayed; then untar directly to the tree.
54 >
55 > That would be quite a bit quicker, mainly since it avoids the temp
56 > untaring and rather wasteful rsync call.
57 >
58 > Or... just have the repository module run directly off of the tarball,
59 > with an additional pregenerated index of file -> offset. (that's a
60 > ways off, but something I intend to try at some point).
61
62 Actually, I hacked portage to do this a few years ago. I generated a
63 .zip[1] of the portdir (was ~16MB, compared to ~120MB uncompressed). The
64 server maintained diffs in the following scheme:
65
66 - Full snapshot every hour
67 - Deltas hourly back 24 hours
68 - Deltas daily back a week
69 - Deltas weekly back two months
70
71 When a user synced, he downloaded a small manifest from the server,
72 telling him the size and contents of the snapshot and deltas. Based on
73 time stamps, he would locally calculate which deltas he would need to
74 fetch. If the size of the deltas were >= size of the full snapshot, just
75 go for the new snapshot.
76
77 This system didn't use xdelta, just .zips, but it could.
78
79 Locally, everything was stored in /usr/portage.zip (but could be
80 anywhere), and I hacked portage to read everything straight out the .zip
81 file instead of the file system.
82
83 Whenever a package was being merged, the ebuild and all stuff in files/
84 was extracted, so that the cli tools (bash, ebuild script) could get at
85 them.
86
87 Performance was not really an issue, since already then, there was some
88 caching going on. emerge -s, emerge <package>, emerge -pv world was not
89 appreciably slower. emerge metadata was:/ This may have changed by now,
90 and unfavourably so.
91
92
93 However, the patch was obviously rather intrusive, and people liked
94 rsync a lot, so it never went in. However, sign me up for hacking on the
95 "sync module", whenever that's gonna happen.
96
97
98 The reason I'm playing around with zsync, is that it's a lot less
99 intrusive than my zipfs patch. Essentially, it's a bolt-on that can be
100 added without modifying portage at all, as long as users don't use
101 "emerge sync" to sync.
102
103 -- Karl T
104
105 [1] .zips have a central directory, which makes it faster to search than
106 tar.gz. Also, they're directly supported by the python library, and you
107 can read out individual files pretty easily. Any compression format with
108 similar properties would do, of course.
109 --
110 gentoo-dev@g.o mailing list

Replies

Subject Author
Re: [gentoo-dev] Proposal for an alternative portage tree sync method Brian Harring <ferringb@g.o>