1 |
On Sun, 23 Nov 2008 07:17:40 -0500 |
2 |
"Emma Strubell" <emma.strubell@×××××.com> wrote: |
3 |
|
4 |
> However, I've started looking at the code, and I must admit I'm pretty |
5 |
> overwhelmed! I don't know where to start. I was wondering if anyone |
6 |
> on here could give me a quick overview of how the search function |
7 |
> currently works, an idea as to what could be modified or implemented |
8 |
> in order to improve the running time of this code, or any tip really |
9 |
> as to where I should start or what I should start looking at. I'd |
10 |
> really appreciate any help or advice!! |
11 |
|
12 |
Well, it depends how much effort you want to put into this. The current |
13 |
interface doesn't actually provide a "search" interface, but merely |
14 |
functions to |
15 |
1) list all package names - dbapi.cp_all() |
16 |
2) list all package names and versions - dbapi.cpv_all() |
17 |
3) list all versions for a given package name - dbapi.cp_list() |
18 |
4) read metadata (like DESCRIPTION) for a given package name and |
19 |
version - dbapi.aux_get() |
20 |
|
21 |
One of the main performance problems of --search is that there is no |
22 |
persistent cache for functions 1, 2 and 3, so if you're "just" |
23 |
interested in performance aspects you might want to look into that. |
24 |
The issue with implementing a persistent cache is that you have to |
25 |
consider both cold and hot filesystem cache cases: Loading an index |
26 |
file with package names and versions might improve the cold-cache case, |
27 |
but slow things down when the filesystem cache is populated. |
28 |
As has been mentioned, keeping the index updated is the other major |
29 |
issue, especially as it has to be portable and should require little or |
30 |
no configuration/setup for the user (so no extra daemons or special |
31 |
filesystems running permanently in the background). The obvious |
32 |
solution would be to generate the cache after `emerge --sync` (and other |
33 |
sync implementations) and hope that people don't modify their tree and |
34 |
search for the changes in between (that's what all the external tools |
35 |
do). I don't know if there is actually a way to do online updates while |
36 |
still improving performance and not relying on custom system daemons |
37 |
running in the background. |
38 |
|
39 |
As for --searchdesc, one problem is that dbapi.aux_get() can only |
40 |
operate on a single package-version on each call (though it can read |
41 |
multiple metadata variables). So for description searches the control |
42 |
flow is like this (obviously simplified): |
43 |
|
44 |
result = [] |
45 |
# iterate over all packages |
46 |
for package in dbapi.cp_all(): |
47 |
# determine the current version of each package, this is |
48 |
# another performance issue. |
49 |
version = get_current_version(package) |
50 |
# read package description from metadata cache |
51 |
description = dbapi.aux_get(version, ["DESCRIPTION"])[0] |
52 |
# check if the description matches |
53 |
if matches(description, searchkey): |
54 |
result.append(package) |
55 |
|
56 |
There you see the three bottlenecks: the lack of a pregenerated package |
57 |
list, the version lookup for *each* package and the actual metadata |
58 |
read. I've already talked about the first, so lets look at the other |
59 |
two. The core problem there is that DESCRIPTION (like all standard |
60 |
metadata variables) is version specific, so to access it you need to |
61 |
determine a version to use, even though in almost all cases the |
62 |
description is the same (or very similar) for all versions. So the |
63 |
proper solution would be to make the description a property of the |
64 |
package name instead of the package version, but that's a _huge_ task |
65 |
you're probably not interested in. What _might_ work here is to add |
66 |
support for an optional package-name->description cache that can be |
67 |
generated offline and includes those packages where all versions have |
68 |
the same description, and fall back to the current method if the |
69 |
package is not included in the cache. (Don't think about caching the |
70 |
version lookup, that's system dependent and therefore not suitable for |
71 |
caching). |
72 |
|
73 |
Hope it has become clear that while the actual search algorithm might |
74 |
be simple and not very efficient, the real problem lies in getting the |
75 |
data to operate on. |
76 |
|
77 |
That and the somewhat limited dbapi interface. |
78 |
|
79 |
Disclaimer: The stuff below involves extending and redesigning some |
80 |
core portage APIs. This isn't something you can do on a weekend, only |
81 |
work on this if you want to commit yourself to portage development |
82 |
for a long time. |
83 |
|
84 |
The functions listed above are the bare minimum to |
85 |
perform queries on the package repositories, but they're very |
86 |
low-level. That means that whenever you want to select packages by |
87 |
name, description, license, dependencies or other variables you need |
88 |
quite a bit of custom code, more if you want to combine multiple |
89 |
searches, and much more if you want to do it efficient and flexible. |
90 |
See http://dev.gentoo.org/~genone/scripts/metalib.py and |
91 |
http://dev.gentoo.org/~genone/scripts/metascan for a somewhat flexible, |
92 |
but very inefficient search tool (might not work anymore due to old |
93 |
age). |
94 |
|
95 |
Ideally repository searches could be done without writing any |
96 |
application code using some kind of query language, similar to how SQL |
97 |
works for generic database searches (obviously not that complex). |
98 |
But before thinking about that we'd need a query API that actually |
99 |
a) allows tools to assemble queries without having to worry about |
100 |
implementation details |
101 |
b) run them efficiently without bothering the API user |
102 |
|
103 |
Simple example: Find all package-versions in the sys-apps category that |
104 |
are BSD-licensed. |
105 |
|
106 |
Currently that would involve something like: |
107 |
|
108 |
result = [] |
109 |
for package is dbapi.cp_all(): |
110 |
if not package.startswith("sys-apps/"): |
111 |
continue |
112 |
for version in dbapi.cp_list(package): |
113 |
license = dbapi.aux_get(version, ["LICENSE"])[0] |
114 |
# for simplicity perform a equivalence check, in reality you'd |
115 |
# have to account for complex license definitions |
116 |
if license == "BSD": |
117 |
result.append(version) |
118 |
|
119 |
Not very friendly to maintain, and not very efficient (we'd only need |
120 |
to iterate over packages in the 'sys-apps' category, but the interface |
121 |
doesn't allow that). |
122 |
And now how it might look with a extensive query interface: |
123 |
|
124 |
query = AndQuery() |
125 |
query.add(CategoryQuery("sys-apps", FullStringMatch())) |
126 |
query.add(MetadataQuery("BSD", FullStringMatch())) |
127 |
result = repository.selectPackages(query) |
128 |
|
129 |
Much nicer, don't you think? |
130 |
|
131 |
As said, implementing such a thing would be a huge amount of work, even |
132 |
if just implemented as wrappers on top of the current interface (which |
133 |
would prevent many efficiency improvements), but if you (or anyone else |
134 |
for that matter) are truly interested in this contact me off-list, |
135 |
maybe I can find some of my old design ideas and (incomplete) |
136 |
prototypes to give you a start. |
137 |
|
138 |
Marius |