1 |
Karl Trygve Kalleberg wrote: |
2 |
>>So... this basically is applicable (at this point) to snapshots, |
3 |
>>since fundamentally that's what it works on. Couple of flaws/issues |
4 |
>>though. |
5 |
>>Tarball entries are rounded up to the nearest multiple of 512 for the |
6 |
>>file size, plus an additional 512 for the tar header. If for the |
7 |
>>zsync chksum index, you're using blocksizes above (basically) 1kb, |
8 |
>>you lose the ability to update individual files- actually, you already |
9 |
>>lost it, because zsync requires two matching blocks, side by side. |
10 |
>>So that's more bandwidth, beyond just pulling the control file. |
11 |
> |
12 |
> |
13 |
> Actually, packing the tree in squashfs _without_ compression, shaved |
14 |
> about 800bytes per file. |
15 |
> Having a tarball of the porttree is obviously |
16 |
> plain stupid, as the overhead about as big as the content itself. |
17 |
'cept tarballs _are_ what our snapshots are currently, which is what I |
18 |
was referencing (was pointing out why zsync is not going to play nice |
19 |
with tarballs). I haven't compared squashfs snapshots w/out compression |
20 |
delta wise, but I'd expect they're slightly larger (diffball knows about |
21 |
tarfile structures, as such can enforce 'locality' for better matches). |
22 |
|
23 |
>>Or... just have the repository module run directly off of the tarball, |
24 |
>>with an additional pregenerated index of file -> offset. (that's a |
25 |
>>ways off, but something I intend to try at some point). |
26 |
> |
27 |
> |
28 |
> Actually, I hacked portage to do this a few years ago. I generated a |
29 |
> .zip[1] of the portdir (was ~16MB, compared to ~120MB uncompressed). The |
30 |
> server maintained diffs in the following scheme: |
31 |
> |
32 |
> - Full snapshot every hour |
33 |
> - Deltas hourly back 24 hours |
34 |
> - Deltas daily back a week |
35 |
> - Deltas weekly back two months |
36 |
Elaborate; back from when, the current time/date? Or just version |
37 |
'leaps' as it were? If you're recalc'ing the delta all the way back for |
38 |
each hour, the cost adds up. |
39 |
|
40 |
> When a user synced, he downloaded a small manifest |
41 |
Define small, and what was in the manifest please. |
42 |
|
43 |
> from the server, |
44 |
> telling him the size and contents of the snapshot and deltas. Based on |
45 |
> time stamps |
46 |
What about issues with users clock being wacky? Yes, systems should |
47 |
have a correct clock, but rsync (with our current opts) doesn't rely on |
48 |
mtime checks (iirc). Course just pulling the last timestamp from the |
49 |
server addresses this... |
50 |
|
51 |
> he would locally calculate which deltas he would need to |
52 |
> fetch. |
53 |
One failing with this I'd see is that in generating a *total*, tree |
54 |
snapshot to tree snapshot delta, the unmatched files (files that are |
55 |
new, or cannot be mapped back via filepath to the older snapshot) can't |
56 |
be easily diff'ed. Can be worked around though. |
57 |
|
58 |
> If the size of the deltas were >= size of the full snapshot, just |
59 |
> go for the new snapshot. |
60 |
> |
61 |
> This system didn't use xdelta, just .zips, but it could. |
62 |
> |
63 |
> Locally, everything was stored in /usr/portage.zip (but could be |
64 |
> anywhere), and I hacked portage to read everything straight out the .zip |
65 |
> file instead of the file system. |
66 |
Sounds like one helluva hack :) |
67 |
|
68 |
> Whenever a package was being merged, the ebuild and all stuff in files/ |
69 |
> was extracted, so that the cli tools (bash, ebuild script) could get at |
70 |
> them. |
71 |
I'd wonder how to integrate gpg/md5'ing of the snapshot into that. |
72 |
Shouldn't be hard, but would be expensive w/out careful management (ie, |
73 |
don't re-verify a repo if the repo has been verified once already). |
74 |
Offhand, this *should* be possible in a clean way with a bit of work. |
75 |
|
76 |
> Performance was not really an issue, since already then, there was some |
77 |
> caching going on. emerge -s, emerge <package>, emerge -pv world was not |
78 |
> appreciably slower. emerge metadata was:/ This may have changed by now, |
79 |
> and unfavourably so. |
80 |
emerge metadata in cvs head *now* pretty much requires 2*nodes in the |
81 |
new tree; read from the metadata/cache, translate it[1], dump it. While |
82 |
doing this, build up a dict of invalid metadata on the local system, |
83 |
wipe it post metadata transfer. So... uncompressed a file, then |
84 |
interpretting it would be likely slower then the current flat list |
85 |
approach (it's actually pretty speedy in .19 and head). External cache |
86 |
db? sqlite seems like overkill, and anydbm has concurrency issues for |
87 |
updates, but since the repo is effectively 'frozen' (user can't modify |
88 |
the ebuild), anydbm should suffice. |
89 |
|
90 |
[1] eclass translation- stable stores eclass data per cache entry in two |
91 |
locations, eclass_db, and cache backend. Had quite a few bugs with |
92 |
this, and it's kind of screwwy in design. Head stores *all* of that |
93 |
entries eclass data in the cache backend; thus going from |
94 |
metadata/cache's just INHERITED="eutils" (fex), you have to translate it |
95 |
to a _full_ eclass entry for the cache backend, eutils\tlocation\tmtime |
96 |
(roughly, code isn't in front of me). |
97 |
|
98 |
> However, the patch was obviously rather intrusive, and people liked |
99 |
> rsync a lot, so it never went in. |
100 |
|
101 |
> However, sign me up for hacking on the |
102 |
> "sync module", whenever that's gonna happen. |
103 |
gentoo-src/portage/sync <-- cvs head/main. |
104 |
|
105 |
'transports' (fetchcommand/resumecommand) are also abstracted into |
106 |
gentoo-src/transports/fetchcommand (iirc). Also is a bundled |
107 |
httplib/ftplib that needs to be put to better use in a binhost |
108 |
refactored repository db, in |
109 |
gentoo-src/transports/bundled_lib (again, iirc, atm stuck in windows |
110 |
land due to the holidays). |
111 |
|
112 |
|
113 |
> The reason I'm playing around with zsync, is that it's a lot less |
114 |
> intrusive than my zipfs patch. |
115 |
URL For zipfs patch? |
116 |
|
117 |
> Essentially, it's a bolt-on that can be |
118 |
> added without modifying portage at all, as long as users don't use |
119 |
> "emerge sync" to sync. |
120 |
emerge sync should use the sync module bound to each repository (not |
121 |
finished, intended). The sync refactoring code that's in cvs head |
122 |
already is the start of this; each sync instance just has a common hook |
123 |
you call. So... emerge sync is viable, assuming an appropriate sync |
124 |
class could be defined. |
125 |
|
126 |
> [1] .zips have a central directory, which makes it faster to search than |
127 |
> tar.gz. Also, they're directly supported by the python library, and you |
128 |
> can read out individual files pretty easily. Any compression format with |
129 |
> similar properties would do, of course. |
130 |
Was commenting on uncompressed tarballs, with a pregenerated file -> |
131 |
offset lookup. Working within *one* compressed stream (which a tar.gz |
132 |
is) wasn't the intention. Doing random seeks in it isn't really viable. |
133 |
Heading off any "use gzseek" by others, gzseek either reads forward, |
134 |
or resets the stream, and starts from the ground up. Aside from that, |
135 |
tarballs, too, are directly supported (tarfile) :) |
136 |
~brian |
137 |
-- |
138 |
gentoo-dev@g.o mailing list |