Gentoo Archives: gentoo-soc

From: "Robert R. Russell" <robert@××××××××××.com>
To: gentoo-soc@l.g.o
Subject: Re: [gentoo-soc] Questions about Cache Sync idea for 2010 soc
Date: Tue, 09 Mar 2010 23:55:19
Message-Id: 20100309235548.GA28974@gahnak.rrbrussell.com
In Reply to: Re: [gentoo-soc] Questions about Cache Sync idea for 2010 soc by Zac Medico
1 On Tue, Mar 09, 2010 at 10:42:38AM -0800, Zac Medico wrote:
2 > On 03/08/2010 09:21 AM, Robert R. Russell wrote:
3 > > The cache sync project[1] wants a way to generate portage's cache
4 > > on the portage tree and/or any chosen overlay and then distribute
5 > > that cache by some method. Correct?
6 >
7 > Well, you can already do that with the egencache program that's
8 > included with portage. I think the gist of the "cache sync" idea is
9 > that you should be able to download the cache for dependency
10 > calculations, and defer the download of the source package until
11 > after the dependency calculation. For doing something like that, a
12 > portage tree is probably not very suitable since the tree can change
13 > rapidly and the cache may invalidate quickly. If the cache and the
14 > source package will be distributed separately, it might be more
15 > practical to make something like a source RPM that contains an
16 > ebuild and eclasses. Many of these source packages could be
17 > distributed in a repository that is independent of the portage tree,
18 > and it's cache may be valid for a longer period of time.
19 >
20
21 I did not know about the egencache program. So I got the wrong initial
22 impression of the project's goals, no problem. The goal is much
23 simpler than my initial impression of it was.
24
25 The worst case change I could see with only a partial copy of the
26 portage tree available locally would be the complete removal of an
27 ebuild between the last sync time and the attempt to that ebuild. By
28 complete removal, I mean the deletion of the ebuild from the tree and
29 removal of the tar-ball from the Gentoo mirror infrastructure. The
30 other common problem would be an incomplete or inaccurate manifest of
31 the ebuild, source tar-balls, and in tree patches. This problem is
32 usually eliminated by re-syncing the tree. So the most 2 likely
33 sources of problems are seen in the wild with the full portage tree
34 and have known work arounds.
35
36 Talking about source RPMs, do you mean something like a tar-ball of
37 the ebuild with is associated patches, eclasses, and other directly
38 dependent data, but no source code? This ebuild tar-ball is then
39 fetched after dependency calculation is made and it provides the
40 instructions for building, downloading, and installing the package
41 from the source tar-ball. That sounds like a Gentoo style replacement
42 for source RPMs.
43
44 >
45 > > This project sounds very similar to an idea I have been toying
46 > > around with for a bit, but I have some questions before I apply
47 > > for this project.
48 > >
49 > > How well documented is the current cache format portage uses?
50 >
51 > It's not very well documented. You might try experimenting with the
52 > egencache program to get a feel for how it works. Cache is generated
53 > by sourcing ebuilds, and it's stored in /var/cache/edb/dep. It's
54 > validated by comparing ebuild and eclass timestamps to those that
55 > are saved in the cache entry. After a complete cache entry is
56 > generated for /var/cache/edb/dep, an incomplete cache entry (lacking
57 > eclass timestamps, since the format hasn't been extended to support
58 > them yet) is written into $PORTDIR/metadata/cache. There is a
59 > discussion about extending the format to include eclass digests here:
60 >
61 > http://archives.gentoo.org/gentoo-dev/msg_cfa80e33ee5fa6f854120ddfb9b468b3.xml
62 >
63 > > What restrictions if any would be placed on extending the current cache format?
64 >
65 > It has to be backward compatible. If we want to change the format in
66 > a backward incompatible way, for example by combining the whole
67 > cache into a single text file, we'll have to distribute both formats
68 > until users have had time to migrate to a package manager that
69 > supports the new format.
70 >
71
72 I think that any change to support the ebuild tar-ball format would
73 require the inclusion of some sort of cryptographic hash of the ebuild
74 tar-ball into the cache format. Another solution might be distributing
75 a large pile of public key signatures with the cache and then
76 validating the signature of the ebuild tar-ball. With the exception of
77 package manager support the cryptographic signature method is probably
78 the least intrusive method. Well is at first glance. I might change my
79 mind on that.
80
81 >
82 > > How well documented is the ebuild file format?
83 >
84 > It's pretty well documented by PMS. You can get that by installing
85 > app-doc/pms. For something that's much shorter and less
86 > comprehensive, there's the `man 5 ebuild`.
87 >
88 > > How much of the ebuild is essential for portage to create a valid
89 > > cache entry?
90 >
91 > The whole ebuild and any eclasses that it inherits.
92 >
93 > > How stable and well documented is the format of the cache
94 > > essential pieces of an ebuild?
95 >
96 > It's very stable because it has to be backward compatible. Breaking
97 > compatibility would be a sever problem because dependency
98 > calculations are very slow unless there is a valid/compatible cache
99 > available.
100 >
101
102 I think that keeping the slim tree package manager cache format
103 compatible with the full tree package manager cache format is not
104 going to be easy. Mainly because of the amount of new data needed in
105 the slim tree variant of the cache.
106
107 stuff like:
108 1. Repository -- Is this cache information from the main tree or
109 from an overlay?
110 2. Hash or signature of the ebuild tar-ball -- How do I validate
111 whether the tar-ball I downloaded is Ok.
112 3. Tags -- If the tags soc project is accepted then they will need
113 to be cached for searching as well.
114 4. Speed improvements -- Any required changes that can improve the
115 performance of searches and the like.
116 5. Change tracking -- Any cache format for a slim tree will need to
117 be able to update from one revision to another easily and with
118 as little bandwidth as reasonably possible.
119
120 >
121 > > Is there any previous work on this or a project that might overlap
122 > > with this project? Such as, an attempt at a new parser for portage.
123 >
124 > I know that Mounir Lamouri (volkmar@gentoo) has been thinking about
125 > a new cache format that will use a single file for the whole cache.
126 >
127 > > Will there be mandatory discussion between the person doing this
128 > > project and the person doing the tags support project?
129 >
130 > Tags are a separate project.
131 >
132 > > Is improving the performance of the cache and/or search feature
133 > > a mandatory goal of this project?
134 >
135 > Well, the cache should probably all go in a single file, and that
136 > will probably improve performance because generally it's faster to
137 > load one big file than a bunch of small files.
138 >
139 > > Thank you.
140 > >
141 > > [1] http://en.gentoo-wiki.com/wiki/Google_Summer_of_Code_2010_ideas#Cache_sync
142 > >
143 > --
144 > Thanks,
145 > Zac
146 >
147
148 Thank you for the information and I will ponder it for a little bit
149 and look at some different design angles.