Gentoo Archives: gentoo-portage-dev

From: Zac Medico <zmedico@g.o>
To: gentoo-portage-dev@l.g.o
Subject: Re: [gentoo-portage-dev] Re: [RFC] Package description index file format for faster emerge search actions
Date: Tue, 14 Oct 2014 21:52:27
Message-Id: 543D9B14.8060708@gentoo.org
In Reply to: [gentoo-portage-dev] Re: [RFC] Package description index file format for faster emerge search actions by Martin Vaeth
1 On 10/14/2014 09:27 AM, Martin Vaeth wrote:
2 > Zac Medico <zmedico@g.o> wrote:
3 >>
4 >> If we really want to index the homepage, then a more extensible format
5 >> might be better. For example, each line of the index could be a JSON
6 >> object like this:
7 >>
8 >> {"description": "sandbox'd LD_PRELOAD hack", "homepage":
9 >> "http://www.gentoo.org/proj/en/portage/sandbox/", "package_versions":
10 >> "sys-apps/sandbox-1.6-r2,2.3-r1,2.4,2.5,2.6-r1"}
11 >
12 > ...and when you also add some other data like LICENSE, IUSE, KEYWORDS,
13 > SLOT and (optionally) {P,R,}DEPEND, you essentially end up with the
14 > eix database (/var/cache/eix/portage.eix generated by eix-update).
15
16 After some thought, I'd prefer to stick with the simpler non-extensible
17 format described in my first email. Reasons include:
18
19 1) The package names and descriptions are, by far, the most commonly
20 searched items. So, for general use, emerge --search/--searchdesc
21 actions should be sufficient for most users. More advanced queries are
22 better suited to something like eix-db or sqlite, but the majority of
23 users have negligible interest in performing such advanced queries, so
24 it's hard to justify distributing a relatively large binary database
25 inside the package repository (it puts extra load on the rsync servers).
26 So, I think it's better to generate such databases on the client side,
27 using $repository/metadata/md5-cache as a source when available.
28
29 2) A plain text index, like the one I originally suggested, is small
30 enough (1.5 MB for current gentoo-x86) so that the additional load it
31 puts on the rsync servers should be manageable. Also, for repositories
32 distributed via a vcs such as git, changes to the plain text index will
33 transfer efficiently (only differences are transferred).
34
35 > The only difference is that the eix database is even somewhat more
36 > optimized for searching (so that e.g. the homepage and license data
37 > can be skipped without reading when you are looking for the description)
38 > and slightly compressed (common words are in a dictionary).
39 >
40 > It would make sense that portage and eix agree on a common database
41 > instead of caching essentially the same information twice.
42 > This appears especially important in view of things like eix-remote
43 > where this data is "merged" for various overlays without downloading
44 > the overlays.
45
46 I agree, but as mentioned above, I think it's better to generate such a
47 database on the client side, using $repository/metadata/md5-cache as a
48 source when available.
49
50 > You can find the description of the current format of the eix database
51 > in eix-db.html when you emerge eix with USE=doc.
52
53 Thanks for the info. I've reviewed the specification, and it looks like
54 a nice format. However, if we're going to create a shared database that
55 suits the needs of both portage and eix, then I would prefer to use a
56 general-purpose RDBMS such as sqlite (sqlightning looks interesting,
57 btw). Why should be go to the trouble of developing and maintaining a
58 special-purpose RDBMS, even though a general-purpose RDBMS suits our needs?
59 --
60 Thanks,
61 Zac