1 |
On 10/14/2014 09:27 AM, Martin Vaeth wrote: |
2 |
> Zac Medico <zmedico@g.o> wrote: |
3 |
>> |
4 |
>> If we really want to index the homepage, then a more extensible format |
5 |
>> might be better. For example, each line of the index could be a JSON |
6 |
>> object like this: |
7 |
>> |
8 |
>> {"description": "sandbox'd LD_PRELOAD hack", "homepage": |
9 |
>> "http://www.gentoo.org/proj/en/portage/sandbox/", "package_versions": |
10 |
>> "sys-apps/sandbox-1.6-r2,2.3-r1,2.4,2.5,2.6-r1"} |
11 |
> |
12 |
> ...and when you also add some other data like LICENSE, IUSE, KEYWORDS, |
13 |
> SLOT and (optionally) {P,R,}DEPEND, you essentially end up with the |
14 |
> eix database (/var/cache/eix/portage.eix generated by eix-update). |
15 |
|
16 |
After some thought, I'd prefer to stick with the simpler non-extensible |
17 |
format described in my first email. Reasons include: |
18 |
|
19 |
1) The package names and descriptions are, by far, the most commonly |
20 |
searched items. So, for general use, emerge --search/--searchdesc |
21 |
actions should be sufficient for most users. More advanced queries are |
22 |
better suited to something like eix-db or sqlite, but the majority of |
23 |
users have negligible interest in performing such advanced queries, so |
24 |
it's hard to justify distributing a relatively large binary database |
25 |
inside the package repository (it puts extra load on the rsync servers). |
26 |
So, I think it's better to generate such databases on the client side, |
27 |
using $repository/metadata/md5-cache as a source when available. |
28 |
|
29 |
2) A plain text index, like the one I originally suggested, is small |
30 |
enough (1.5 MB for current gentoo-x86) so that the additional load it |
31 |
puts on the rsync servers should be manageable. Also, for repositories |
32 |
distributed via a vcs such as git, changes to the plain text index will |
33 |
transfer efficiently (only differences are transferred). |
34 |
|
35 |
> The only difference is that the eix database is even somewhat more |
36 |
> optimized for searching (so that e.g. the homepage and license data |
37 |
> can be skipped without reading when you are looking for the description) |
38 |
> and slightly compressed (common words are in a dictionary). |
39 |
> |
40 |
> It would make sense that portage and eix agree on a common database |
41 |
> instead of caching essentially the same information twice. |
42 |
> This appears especially important in view of things like eix-remote |
43 |
> where this data is "merged" for various overlays without downloading |
44 |
> the overlays. |
45 |
|
46 |
I agree, but as mentioned above, I think it's better to generate such a |
47 |
database on the client side, using $repository/metadata/md5-cache as a |
48 |
source when available. |
49 |
|
50 |
> You can find the description of the current format of the eix database |
51 |
> in eix-db.html when you emerge eix with USE=doc. |
52 |
|
53 |
Thanks for the info. I've reviewed the specification, and it looks like |
54 |
a nice format. However, if we're going to create a shared database that |
55 |
suits the needs of both portage and eix, then I would prefer to use a |
56 |
general-purpose RDBMS such as sqlite (sqlightning looks interesting, |
57 |
btw). Why should be go to the trouble of developing and maintaining a |
58 |
special-purpose RDBMS, even though a general-purpose RDBMS suits our needs? |
59 |
-- |
60 |
Thanks, |
61 |
Zac |