Re: [gentoo-dev] Proposal for an alternative portage tree sync method - gentoo-dev

From:	Brian Harring <ferringb@g.o>
To:	gentoo-dev@××××××××××××.org
Subject:	Re: [gentoo-dev] Proposal for an alternative portage tree sync method
Date:	Sun, 27 Mar 2005 19:03:02
Message-Id:	`42470379.1040609@gentoo.org`
In Reply to:	Re: [gentoo-dev] Proposal for an alternative portage tree sync method by Karl Trygve Kalleberg

1

Karl Trygve Kalleberg wrote:

2

>>So... this basically is applicable (at this point) to snapshots,

3

>>since fundamentally that's what it works on.  Couple of flaws/issues

4

>>though.

5

>>Tarball entries are rounded up to the nearest multiple of 512 for the

6

>>file size, plus an additional 512 for the tar header.  If for the

7

>>zsync chksum index, you're using blocksizes above (basically) 1kb,

8

>>you lose the ability to update individual files- actually, you already

9

>>lost it, because zsync requires two matching blocks, side by side.

10

>>So that's more bandwidth, beyond just pulling the control file.

11

>

12

>

13

> Actually, packing the tree in squashfs _without_ compression, shaved

14

> about 800bytes per file.

15

> Having a tarball of the porttree is obviously

16

> plain stupid, as the overhead about as big as the content itself.

17

'cept tarballs _are_ what our snapshots are currently, which is what I 

18

was referencing (was pointing out why zsync is not going to play nice 

19

with tarballs).  I haven't compared squashfs snapshots w/out compression 

20

delta wise, but I'd expect they're slightly larger (diffball knows about 

21

tarfile structures, as such can enforce 'locality' for better matches).

22

23

>>Or... just have the repository module run directly off of the tarball,

24

>>with an additional pregenerated index of file -> offset.  (that's a

25

>>ways off, but something I intend to try at some point).

26

>

27

>

28

> Actually, I hacked portage to do this a few years ago. I generated a

29

> .zip[1] of the portdir (was ~16MB, compared to ~120MB uncompressed). The

30

> server maintained diffs in the following scheme:

31

>

32

> - Full snapshot every hour

33

> - Deltas hourly back 24 hours

34

> - Deltas daily back a week

35

> - Deltas weekly back two months

36

Elaborate; back from when, the current time/date?  Or just version 

37

'leaps' as it were?  If you're recalc'ing the delta all the way back for 

38

each hour, the cost adds up.

39

40

> When a user synced, he downloaded a small manifest

41

Define small, and what was in the manifest please.

42

43

> from the server,

44

> telling him the size and contents of the snapshot and deltas. Based on

45

> time stamps

46

What about issues with users clock being wacky?  Yes, systems should 

47

have a correct clock, but rsync (with our current opts) doesn't rely on 

48

mtime checks (iirc).  Course just pulling the last timestamp from the 

49

server addresses this...

50

51

> he would locally calculate which deltas he would need to

52

> fetch.

53

One failing with this I'd see is that in generating a *total*, tree 

54

snapshot to tree snapshot delta, the unmatched files (files that are 

55

new, or cannot be mapped back via filepath to the older snapshot) can't 

56

be easily diff'ed.  Can be worked around though.

57

58

> If the size of the deltas were >= size of the full snapshot, just

59

> go for the new snapshot.

60

>

61

> This system didn't use xdelta, just .zips, but it could.

62

>

63

> Locally, everything was stored in /usr/portage.zip (but could be

64

> anywhere), and I hacked portage to read everything straight out the .zip

65

> file instead of the file system.

66

Sounds like one helluva hack :)

67

68

> Whenever a package was being merged, the ebuild and all stuff in files/

69

> was extracted, so that the cli tools (bash, ebuild script) could get at

70

> them.

71

I'd wonder how to integrate gpg/md5'ing of the snapshot into that. 

72

Shouldn't be hard, but would be expensive w/out careful management (ie, 

73

don't re-verify a repo if the repo has been verified once already).

74

Offhand, this *should* be possible in a clean way with a bit of work.

75

76

> Performance was not really an issue, since already then, there was some

77

> caching going on. emerge -s, emerge <package>, emerge -pv world was not

78

> appreciably slower. emerge metadata was:/ This may have changed by now,

79

> and unfavourably so.

80

emerge metadata in cvs head *now* pretty much requires 2*nodes in the 

81

new tree; read from the metadata/cache, translate it[1], dump it.  While 

82

doing this, build up a dict of invalid metadata on the local system, 

83

wipe it post metadata transfer.  So... uncompressed a file, then 

84

interpretting it would be likely slower then the current flat list 

85

approach (it's actually pretty speedy in .19 and head).  External cache 

86

db?  sqlite seems like overkill, and anydbm has concurrency issues for 

87

updates, but since the repo is effectively 'frozen' (user can't modify 

88

the ebuild), anydbm should suffice.

89

90

[1] eclass translation- stable stores eclass data per cache entry in two 

91

locations, eclass_db, and cache backend.  Had quite a few bugs with 

92

this, and it's kind of screwwy in design.  Head stores *all* of that 

93

entries eclass data in the cache backend; thus going from 

94

metadata/cache's just INHERITED="eutils" (fex), you have to translate it 

95

to a _full_ eclass entry for the cache backend, eutils\tlocation\tmtime 

96

(roughly, code isn't in front of me).

97

98

> However, the patch was obviously rather intrusive, and people liked

99

> rsync a lot, so it never went in.

100

101

> However, sign me up for hacking on the

102

> "sync module", whenever that's gonna happen.

103

gentoo-src/portage/sync <-- cvs head/main.

104

105

'transports' (fetchcommand/resumecommand) are also abstracted into

106

gentoo-src/transports/fetchcommand (iirc).  Also is a bundled 

107

httplib/ftplib that needs to be put to better use in a binhost 

108

refactored repository db, in

109

gentoo-src/transports/bundled_lib (again, iirc, atm stuck in windows 

110

land due to the holidays).

111

112

113

> The reason I'm playing around with zsync, is that it's a lot less

114

> intrusive than my zipfs patch.

115

URL For zipfs patch?

116

117

> Essentially, it's a bolt-on that can be

118

> added without modifying portage at all, as long as users don't use

119

> "emerge sync" to sync.

120

emerge sync should use the sync module bound to each repository (not 

121

finished, intended).  The sync refactoring code that's in cvs head 

122

already is the start of this; each sync instance just has a common hook 

123

you call.  So... emerge sync is viable, assuming an appropriate sync 

124

class could be defined.

125

126

> [1] .zips have a central directory, which makes it faster to search than

127

> tar.gz.  Also, they're directly supported by the python library, and you

128

> can read out individual files pretty easily. Any compression format with

129

> similar properties would do, of course.

130

Was commenting on uncompressed tarballs, with a pregenerated file -> 

131

offset lookup.  Working within *one* compressed stream (which a tar.gz 

132

is) wasn't the intention.  Doing random seeks in it isn't really viable. 

133

  Heading off any "use gzseek" by others, gzseek either reads forward, 

134

or resets the stream, and starts from the ground up.  Aside from that, 

135

tarballs, too, are directly supported (tarfile) :)

136

~brian

137

--

138

gentoo-dev@g.o mailing list

1	Karl Trygve Kalleberg wrote:
2	>>So... this basically is applicable (at this point) to snapshots,
3	>>since fundamentally that's what it works on. Couple of flaws/issues
4	>>though.
5	>>Tarball entries are rounded up to the nearest multiple of 512 for the
6	>>file size, plus an additional 512 for the tar header. If for the
7	>>zsync chksum index, you're using blocksizes above (basically) 1kb,
8	>>you lose the ability to update individual files- actually, you already
9	>>lost it, because zsync requires two matching blocks, side by side.
10	>>So that's more bandwidth, beyond just pulling the control file.
11	>
12	>
13	> Actually, packing the tree in squashfs _without_ compression, shaved
14	> about 800bytes per file.
15	> Having a tarball of the porttree is obviously
16	> plain stupid, as the overhead about as big as the content itself.
17	'cept tarballs _are_ what our snapshots are currently, which is what I
18	was referencing (was pointing out why zsync is not going to play nice
19	with tarballs). I haven't compared squashfs snapshots w/out compression
20	delta wise, but I'd expect they're slightly larger (diffball knows about
21	tarfile structures, as such can enforce 'locality' for better matches).
22
23	>>Or... just have the repository module run directly off of the tarball,
24	>>with an additional pregenerated index of file -> offset. (that's a
25	>>ways off, but something I intend to try at some point).
26	>
27	>
28	> Actually, I hacked portage to do this a few years ago. I generated a
29	> .zip[1] of the portdir (was ~16MB, compared to ~120MB uncompressed). The
30	> server maintained diffs in the following scheme:
31	>
32	> - Full snapshot every hour
33	> - Deltas hourly back 24 hours
34	> - Deltas daily back a week
35	> - Deltas weekly back two months
36	Elaborate; back from when, the current time/date? Or just version
37	'leaps' as it were? If you're recalc'ing the delta all the way back for
38	each hour, the cost adds up.
39
40	> When a user synced, he downloaded a small manifest
41	Define small, and what was in the manifest please.
42
43	> from the server,
44	> telling him the size and contents of the snapshot and deltas. Based on
45	> time stamps
46	What about issues with users clock being wacky? Yes, systems should
47	have a correct clock, but rsync (with our current opts) doesn't rely on
48	mtime checks (iirc). Course just pulling the last timestamp from the
49	server addresses this...
50
51	> he would locally calculate which deltas he would need to
52	> fetch.
53	One failing with this I'd see is that in generating a total, tree
54	snapshot to tree snapshot delta, the unmatched files (files that are
55	new, or cannot be mapped back via filepath to the older snapshot) can't
56	be easily diff'ed. Can be worked around though.
57
58	> If the size of the deltas were >= size of the full snapshot, just
59	> go for the new snapshot.
60	>
61	> This system didn't use xdelta, just .zips, but it could.
62	>
63	> Locally, everything was stored in /usr/portage.zip (but could be
64	> anywhere), and I hacked portage to read everything straight out the .zip
65	> file instead of the file system.
66	Sounds like one helluva hack :)
67
68	> Whenever a package was being merged, the ebuild and all stuff in files/
69	> was extracted, so that the cli tools (bash, ebuild script) could get at
70	> them.
71	I'd wonder how to integrate gpg/md5'ing of the snapshot into that.
72	Shouldn't be hard, but would be expensive w/out careful management (ie,
73	don't re-verify a repo if the repo has been verified once already).
74	Offhand, this should be possible in a clean way with a bit of work.
75
76	> Performance was not really an issue, since already then, there was some
77	> caching going on. emerge -s, emerge <package>, emerge -pv world was not
78	> appreciably slower. emerge metadata was:/ This may have changed by now,
79	> and unfavourably so.
80	emerge metadata in cvs head now pretty much requires 2*nodes in the
81	new tree; read from the metadata/cache, translate it[1], dump it. While
82	doing this, build up a dict of invalid metadata on the local system,
83	wipe it post metadata transfer. So... uncompressed a file, then
84	interpretting it would be likely slower then the current flat list
85	approach (it's actually pretty speedy in .19 and head). External cache
86	db? sqlite seems like overkill, and anydbm has concurrency issues for
87	updates, but since the repo is effectively 'frozen' (user can't modify
88	the ebuild), anydbm should suffice.
89
90	[1] eclass translation- stable stores eclass data per cache entry in two
91	locations, eclass_db, and cache backend. Had quite a few bugs with
92	this, and it's kind of screwwy in design. Head stores all of that
93	entries eclass data in the cache backend; thus going from
94	metadata/cache's just INHERITED="eutils" (fex), you have to translate it
95	to a _full_ eclass entry for the cache backend, eutils\tlocation\tmtime
96	(roughly, code isn't in front of me).
97
98	> However, the patch was obviously rather intrusive, and people liked
99	> rsync a lot, so it never went in.
100
101	> However, sign me up for hacking on the
102	> "sync module", whenever that's gonna happen.
103	gentoo-src/portage/sync <-- cvs head/main.
104
105	'transports' (fetchcommand/resumecommand) are also abstracted into
106	gentoo-src/transports/fetchcommand (iirc). Also is a bundled
107	httplib/ftplib that needs to be put to better use in a binhost
108	refactored repository db, in
109	gentoo-src/transports/bundled_lib (again, iirc, atm stuck in windows
110	land due to the holidays).
111
112
113	> The reason I'm playing around with zsync, is that it's a lot less
114	> intrusive than my zipfs patch.
115	URL For zipfs patch?
116
117	> Essentially, it's a bolt-on that can be
118	> added without modifying portage at all, as long as users don't use
119	> "emerge sync" to sync.
120	emerge sync should use the sync module bound to each repository (not
121	finished, intended). The sync refactoring code that's in cvs head
122	already is the start of this; each sync instance just has a common hook
123	you call. So... emerge sync is viable, assuming an appropriate sync
124	class could be defined.
125
126	> [1] .zips have a central directory, which makes it faster to search than
127	> tar.gz. Also, they're directly supported by the python library, and you
128	> can read out individual files pretty easily. Any compression format with
129	> similar properties would do, of course.
130	Was commenting on uncompressed tarballs, with a pregenerated file ->
131	offset lookup. Working within one compressed stream (which a tar.gz
132	is) wasn't the intention. Doing random seeks in it isn't really viable.
133	Heading off any "use gzseek" by others, gzseek either reads forward,
134	or resets the stream, and starts from the ground up. Aside from that,
135	tarballs, too, are directly supported (tarfile) :)
136	~brian
137	--
138	gentoo-dev@g.o mailing list

Gentoo Archives: gentoo-dev