1 |
Hello everybody, |
2 |
|
3 |
those of us using overlays might have noticed that they can seriously slow |
4 |
down dependency calculation. This is mostly because of the lack of a metadata |
5 |
cache. |
6 |
For overlay maintainers providing a metadata cache is quite tricky because to |
7 |
be really consistent and useful it'd have to be regenerated after every |
8 |
commit. That's quite easy to forget or get wrong. |
9 |
|
10 |
So I sat down, brained some thoughts and played around a bit. Here's what I |
11 |
came up with: |
12 |
|
13 |
* server-side each overlay is checked out |
14 |
* for every overlay in our list: |
15 |
- we add it to make.conf explicitly (avoids any spillover effects) |
16 |
- we let egencache generate a metadata cache for that repository |
17 |
* we rsync the repositories with metadata to a different directory |
18 |
|
19 |
The last step is just there to get rid of all the "unneeded" data like .svn |
20 |
directories and can be used to selectively exclude other data that is in the |
21 |
repo but not needed for end-users. Plus it reduces inconsistent data when a |
22 |
client copies the data while the metadata cache is being generated. |
23 |
|
24 |
egencache creates the per-repository cache in metadata/cache, so it is nicely |
25 |
bundled and won't interfere with anything else. |
26 |
|
27 |
So now we have all repositories, with metadata, in one place. We can start an |
28 |
rsync daemon sharing the parent directory. For users this makes things easier |
29 |
- instead of needind cvs, svn, git, darcs, hg, etc. etc. they only need rsync |
30 |
(which they already have installed!) |
31 |
|
32 |
Layman gets easier too - it just needs to understand the rsync protocol and |
33 |
select the right directory(s). |
34 |
|
35 |
The only issue I have found with this idea relates to eclasses - overriding |
36 |
in-tree eclasses to be precise. The problem there is that it invalidates in- |
37 |
tree metadata and potentially affects other overlays too. So that's a bit of a |
38 |
bummer, but then I wonder how common that case is. |
39 |
|
40 |
For performance, the difference is noticeable. As a very rough pointer it |
41 |
takes me ~15 minutes for "emerge -puNDv world" with three overlays and no |
42 |
metadata cache and about 75 seconds with metadata cache. That's of course a |
43 |
"worst case" scenario. |
44 |
|
45 |
Generating the metadata cache isn't that expensive - it took about 45 minutes |
46 |
to initially check out almost everything layman provided and then about an |
47 |
hour for the first run. Consecutive runs should be much faster and can be run |
48 |
in parallel per overlay (at least in theory). So unless I missed something |
49 |
really big really obvious it should be "small enough" to be run every hour or |
50 |
even faster. |
51 |
|
52 |
Advantages are: |
53 |
- less deps for layman (if it is adapted) |
54 |
- less complexity client-side |
55 |
- faster sync performance - especially svn and git transfer way too much, the |
56 |
initial checkout of one overlay was >35M data for a few dozen ebuilds |
57 |
- less load server-side. Rsync is easy to replicate and relatively cheap. |
58 |
Popular overlays will appreciate the reduced traffic :) |
59 |
- faster dependency calculation |
60 |
and a few I have already forgotten. |
61 |
|
62 |
Disadvantages are: |
63 |
- syncing the main tree can invalidate most of the metadata cache (changed |
64 |
eclasses etc), so you need to sync the overlays at the same time |
65 |
- the eclass override situation I mentioned earlier |
66 |
- slower update time (right now users can checkout immediately after a commit, |
67 |
with this indirection it'd be 30min+ delay) |
68 |
|
69 |
If I don't get distracted I might set up a proof of concept public rsync |
70 |
server providing the main repo plus all overlays I can throw in, but it'd have |
71 |
a low initial update frequency (6h to daily). |
72 |
|
73 |
Your thoughts, opinions and other input is appreciated. |
74 |
|
75 |
Patrick |