Gentoo Archives: gentoo-dev

From: Brian Harring <ferringb@g.o>
To: gentoo-dev@××××××××××××.org
Subject: Re: [gentoo-dev] Proposal for an alternative portage tree sync method
Date: Sun, 27 Mar 2005 19:03:02
In Reply to: Re: [gentoo-dev] Proposal for an alternative portage tree sync method by Karl Trygve Kalleberg
1 Karl Trygve Kalleberg wrote:
2 >>So... this basically is applicable (at this point) to snapshots,
3 >>since fundamentally that's what it works on. Couple of flaws/issues
4 >>though.
5 >>Tarball entries are rounded up to the nearest multiple of 512 for the
6 >>file size, plus an additional 512 for the tar header. If for the
7 >>zsync chksum index, you're using blocksizes above (basically) 1kb,
8 >>you lose the ability to update individual files- actually, you already
9 >>lost it, because zsync requires two matching blocks, side by side.
10 >>So that's more bandwidth, beyond just pulling the control file.
11 >
12 >
13 > Actually, packing the tree in squashfs _without_ compression, shaved
14 > about 800bytes per file.
15 > Having a tarball of the porttree is obviously
16 > plain stupid, as the overhead about as big as the content itself.
17 'cept tarballs _are_ what our snapshots are currently, which is what I
18 was referencing (was pointing out why zsync is not going to play nice
19 with tarballs). I haven't compared squashfs snapshots w/out compression
20 delta wise, but I'd expect they're slightly larger (diffball knows about
21 tarfile structures, as such can enforce 'locality' for better matches).
23 >>Or... just have the repository module run directly off of the tarball,
24 >>with an additional pregenerated index of file -> offset. (that's a
25 >>ways off, but something I intend to try at some point).
26 >
27 >
28 > Actually, I hacked portage to do this a few years ago. I generated a
29 > .zip[1] of the portdir (was ~16MB, compared to ~120MB uncompressed). The
30 > server maintained diffs in the following scheme:
31 >
32 > - Full snapshot every hour
33 > - Deltas hourly back 24 hours
34 > - Deltas daily back a week
35 > - Deltas weekly back two months
36 Elaborate; back from when, the current time/date? Or just version
37 'leaps' as it were? If you're recalc'ing the delta all the way back for
38 each hour, the cost adds up.
40 > When a user synced, he downloaded a small manifest
41 Define small, and what was in the manifest please.
43 > from the server,
44 > telling him the size and contents of the snapshot and deltas. Based on
45 > time stamps
46 What about issues with users clock being wacky? Yes, systems should
47 have a correct clock, but rsync (with our current opts) doesn't rely on
48 mtime checks (iirc). Course just pulling the last timestamp from the
49 server addresses this...
51 > he would locally calculate which deltas he would need to
52 > fetch.
53 One failing with this I'd see is that in generating a *total*, tree
54 snapshot to tree snapshot delta, the unmatched files (files that are
55 new, or cannot be mapped back via filepath to the older snapshot) can't
56 be easily diff'ed. Can be worked around though.
58 > If the size of the deltas were >= size of the full snapshot, just
59 > go for the new snapshot.
60 >
61 > This system didn't use xdelta, just .zips, but it could.
62 >
63 > Locally, everything was stored in /usr/ (but could be
64 > anywhere), and I hacked portage to read everything straight out the .zip
65 > file instead of the file system.
66 Sounds like one helluva hack :)
68 > Whenever a package was being merged, the ebuild and all stuff in files/
69 > was extracted, so that the cli tools (bash, ebuild script) could get at
70 > them.
71 I'd wonder how to integrate gpg/md5'ing of the snapshot into that.
72 Shouldn't be hard, but would be expensive w/out careful management (ie,
73 don't re-verify a repo if the repo has been verified once already).
74 Offhand, this *should* be possible in a clean way with a bit of work.
76 > Performance was not really an issue, since already then, there was some
77 > caching going on. emerge -s, emerge <package>, emerge -pv world was not
78 > appreciably slower. emerge metadata was:/ This may have changed by now,
79 > and unfavourably so.
80 emerge metadata in cvs head *now* pretty much requires 2*nodes in the
81 new tree; read from the metadata/cache, translate it[1], dump it. While
82 doing this, build up a dict of invalid metadata on the local system,
83 wipe it post metadata transfer. So... uncompressed a file, then
84 interpretting it would be likely slower then the current flat list
85 approach (it's actually pretty speedy in .19 and head). External cache
86 db? sqlite seems like overkill, and anydbm has concurrency issues for
87 updates, but since the repo is effectively 'frozen' (user can't modify
88 the ebuild), anydbm should suffice.
90 [1] eclass translation- stable stores eclass data per cache entry in two
91 locations, eclass_db, and cache backend. Had quite a few bugs with
92 this, and it's kind of screwwy in design. Head stores *all* of that
93 entries eclass data in the cache backend; thus going from
94 metadata/cache's just INHERITED="eutils" (fex), you have to translate it
95 to a _full_ eclass entry for the cache backend, eutils\tlocation\tmtime
96 (roughly, code isn't in front of me).
98 > However, the patch was obviously rather intrusive, and people liked
99 > rsync a lot, so it never went in.
101 > However, sign me up for hacking on the
102 > "sync module", whenever that's gonna happen.
103 gentoo-src/portage/sync <-- cvs head/main.
105 'transports' (fetchcommand/resumecommand) are also abstracted into
106 gentoo-src/transports/fetchcommand (iirc). Also is a bundled
107 httplib/ftplib that needs to be put to better use in a binhost
108 refactored repository db, in
109 gentoo-src/transports/bundled_lib (again, iirc, atm stuck in windows
110 land due to the holidays).
113 > The reason I'm playing around with zsync, is that it's a lot less
114 > intrusive than my zipfs patch.
115 URL For zipfs patch?
117 > Essentially, it's a bolt-on that can be
118 > added without modifying portage at all, as long as users don't use
119 > "emerge sync" to sync.
120 emerge sync should use the sync module bound to each repository (not
121 finished, intended). The sync refactoring code that's in cvs head
122 already is the start of this; each sync instance just has a common hook
123 you call. So... emerge sync is viable, assuming an appropriate sync
124 class could be defined.
126 > [1] .zips have a central directory, which makes it faster to search than
127 > tar.gz. Also, they're directly supported by the python library, and you
128 > can read out individual files pretty easily. Any compression format with
129 > similar properties would do, of course.
130 Was commenting on uncompressed tarballs, with a pregenerated file ->
131 offset lookup. Working within *one* compressed stream (which a tar.gz
132 is) wasn't the intention. Doing random seeks in it isn't really viable.
133 Heading off any "use gzseek" by others, gzseek either reads forward,
134 or resets the stream, and starts from the ground up. Aside from that,
135 tarballs, too, are directly supported (tarfile) :)
136 ~brian
137 --
138 gentoo-dev@g.o mailing list