1 |
Brian Harring wrote: |
2 |
> On Thu, Mar 24, 2005 at 03:11:35PM +0100, Karl Trygve Kalleberg wrote: |
3 |
|
4 |
|
5 |
>> If you have (g)cloop installed, it may even be mounted over a |
6 |
>> compressed loopback. A full ISO of the porttree is ~300MB, |
7 |
>> compressed it's ~29MB. |
8 |
> |
9 |
> |
10 |
> This part however, isn't. Note the portion of zsync's docs |
11 |
> referencing doing compressed segment mapping to uncompressed, and |
12 |
> using a modified gzip that restarts segments occasionally to help with |
13 |
> this. |
14 |
> |
15 |
> If you have gcloop abusing the same gzip tweak, sure, it'll work, |
16 |
> although I gurantee the comp. -> uncomp. mapping is going to add more |
17 |
> bandwidth then you'd like (have to get the whole compressed segment, |
18 |
> not just the bytes that have changed). If you're *not* doing any |
19 |
> compressed stream resetting/restarting, read below (it gets worse :) |
20 |
|
21 |
> Squashfs is even worse- you lose the compressed -> uncompressd mapping. |
22 |
> You change a single byte in a compressed stream, and likely all bytes |
23 |
> after that point are now different. So... without an equivalent to the |
24 |
> gzip segmenting hack, you're going to pay through the teeth on updates. |
25 |
|
26 |
Yeah, we noticed that a zsync of a modified squashfs image requires ~50% |
27 |
of the new file to be downloaded. Not exactly proportional to the change. |
28 |
|
29 |
|
30 |
> So... this basically is applicable (at this point) to snapshots, |
31 |
> since fundamentally that's what it works on. Couple of flaws/issues |
32 |
> though. |
33 |
> Tarball entries are rounded up to the nearest multiple of 512 for the |
34 |
> file size, plus an additional 512 for the tar header. If for the |
35 |
> zsync chksum index, you're using blocksizes above (basically) 1kb, |
36 |
> you lose the ability to update individual files- actually, you already |
37 |
> lost it, because zsync requires two matching blocks, side by side. |
38 |
> So that's more bandwidth, beyond just pulling the control file. |
39 |
|
40 |
Actually, packing the tree in squashfs _without_ compression, shaved |
41 |
about 800bytes per file. Having a tarball of the porttree is obviously |
42 |
plain stupid, as the overhead about as big as the content itself. |
43 |
|
44 |
> On a sidenote, SYNC syntax in cvs head is a helluva lot more powerful |
45 |
> then the current stable format; adding new formats/URI hooks in is doable. |
46 |
> |
47 |
> If people are after trying to dodge the cost of untarring, and |
48 |
> rsync'ing for snapshots, well... you're trying to dodge crappy code, |
49 |
> frankly. The algorithm/approach used there is kind of ass backwards. |
50 |
> |
51 |
> There's no reason the intersection of the snapshot's tarball files |
52 |
> set, and the set of files in the portdir can't be computed, and |
53 |
> all other files ixnayed; then untar directly to the tree. |
54 |
> |
55 |
> That would be quite a bit quicker, mainly since it avoids the temp |
56 |
> untaring and rather wasteful rsync call. |
57 |
> |
58 |
> Or... just have the repository module run directly off of the tarball, |
59 |
> with an additional pregenerated index of file -> offset. (that's a |
60 |
> ways off, but something I intend to try at some point). |
61 |
|
62 |
Actually, I hacked portage to do this a few years ago. I generated a |
63 |
.zip[1] of the portdir (was ~16MB, compared to ~120MB uncompressed). The |
64 |
server maintained diffs in the following scheme: |
65 |
|
66 |
- Full snapshot every hour |
67 |
- Deltas hourly back 24 hours |
68 |
- Deltas daily back a week |
69 |
- Deltas weekly back two months |
70 |
|
71 |
When a user synced, he downloaded a small manifest from the server, |
72 |
telling him the size and contents of the snapshot and deltas. Based on |
73 |
time stamps, he would locally calculate which deltas he would need to |
74 |
fetch. If the size of the deltas were >= size of the full snapshot, just |
75 |
go for the new snapshot. |
76 |
|
77 |
This system didn't use xdelta, just .zips, but it could. |
78 |
|
79 |
Locally, everything was stored in /usr/portage.zip (but could be |
80 |
anywhere), and I hacked portage to read everything straight out the .zip |
81 |
file instead of the file system. |
82 |
|
83 |
Whenever a package was being merged, the ebuild and all stuff in files/ |
84 |
was extracted, so that the cli tools (bash, ebuild script) could get at |
85 |
them. |
86 |
|
87 |
Performance was not really an issue, since already then, there was some |
88 |
caching going on. emerge -s, emerge <package>, emerge -pv world was not |
89 |
appreciably slower. emerge metadata was:/ This may have changed by now, |
90 |
and unfavourably so. |
91 |
|
92 |
|
93 |
However, the patch was obviously rather intrusive, and people liked |
94 |
rsync a lot, so it never went in. However, sign me up for hacking on the |
95 |
"sync module", whenever that's gonna happen. |
96 |
|
97 |
|
98 |
The reason I'm playing around with zsync, is that it's a lot less |
99 |
intrusive than my zipfs patch. Essentially, it's a bolt-on that can be |
100 |
added without modifying portage at all, as long as users don't use |
101 |
"emerge sync" to sync. |
102 |
|
103 |
-- Karl T |
104 |
|
105 |
[1] .zips have a central directory, which makes it faster to search than |
106 |
tar.gz. Also, they're directly supported by the python library, and you |
107 |
can read out individual files pretty easily. Any compression format with |
108 |
similar properties would do, of course. |
109 |
-- |
110 |
gentoo-dev@g.o mailing list |