Gentoo Archives: gentoo-portage-dev

From:	Zac Medico <zmedico@g.o>
To:	gentoo-portage-dev@l.g.o
Subject:	Re: [gentoo-portage-dev] Re: [RFC] Package description index file format for faster emerge search actions
Date:	Tue, 14 Oct 2014 21:52:27
Message-Id:	`543D9B14.8060708@gentoo.org`
In Reply to:	[gentoo-portage-dev] Re: [RFC] Package description index file format for faster emerge search actions by Martin Vaeth

1	On 10/14/2014 09:27 AM, Martin Vaeth wrote:
2	> Zac Medico <zmedico@g.o> wrote:
3	>>
4	>> If we really want to index the homepage, then a more extensible format
5	>> might be better. For example, each line of the index could be a JSON
6	>> object like this:
7	>>
8	>> {"description": "sandbox'd LD_PRELOAD hack", "homepage":
9	>> "http://www.gentoo.org/proj/en/portage/sandbox/", "package_versions":
10	>> "sys-apps/sandbox-1.6-r2,2.3-r1,2.4,2.5,2.6-r1"}
11	>
12	> ...and when you also add some other data like LICENSE, IUSE, KEYWORDS,
13	> SLOT and (optionally) {P,R,}DEPEND, you essentially end up with the
14	> eix database (/var/cache/eix/portage.eix generated by eix-update).
15
16	After some thought, I'd prefer to stick with the simpler non-extensible
17	format described in my first email. Reasons include:
18
19	1) The package names and descriptions are, by far, the most commonly
20	searched items. So, for general use, emerge --search/--searchdesc
21	actions should be sufficient for most users. More advanced queries are
22	better suited to something like eix-db or sqlite, but the majority of
23	users have negligible interest in performing such advanced queries, so
24	it's hard to justify distributing a relatively large binary database
25	inside the package repository (it puts extra load on the rsync servers).
26	So, I think it's better to generate such databases on the client side,
27	using $repository/metadata/md5-cache as a source when available.
28
29	2) A plain text index, like the one I originally suggested, is small
30	enough (1.5 MB for current gentoo-x86) so that the additional load it
31	puts on the rsync servers should be manageable. Also, for repositories
32	distributed via a vcs such as git, changes to the plain text index will
33	transfer efficiently (only differences are transferred).
34
35	> The only difference is that the eix database is even somewhat more
36	> optimized for searching (so that e.g. the homepage and license data
37	> can be skipped without reading when you are looking for the description)
38	> and slightly compressed (common words are in a dictionary).
39	>
40	> It would make sense that portage and eix agree on a common database
41	> instead of caching essentially the same information twice.
42	> This appears especially important in view of things like eix-remote
43	> where this data is "merged" for various overlays without downloading
44	> the overlays.
45
46	I agree, but as mentioned above, I think it's better to generate such a
47	database on the client side, using $repository/metadata/md5-cache as a
48	source when available.
49
50	> You can find the description of the current format of the eix database
51	> in eix-db.html when you emerge eix with USE=doc.
52
53	Thanks for the info. I've reviewed the specification, and it looks like
54	a nice format. However, if we're going to create a shared database that
55	suits the needs of both portage and eix, then I would prefer to use a
56	general-purpose RDBMS such as sqlite (sqlightning looks interesting,
57	btw). Why should be go to the trouble of developing and maintaining a
58	special-purpose RDBMS, even though a general-purpose RDBMS suits our needs?
59	--
60	Thanks,
61	Zac

Report Message

Find on MARC Find on Google Groups