Gentoo Archives: gentoo-portage-dev

From: Marius Mauch <genone@g.o>
To: gentoo-portage-dev@l.g.o
Subject: Re: [gentoo-portage-dev] search functionality in emerge
Date: Mon, 24 Nov 2008 03:13:33
Message-Id: 20081124041257.755679eb.genone@gentoo.org
In Reply to: [gentoo-portage-dev] search functionality in emerge by Emma Strubell
1 On Sun, 23 Nov 2008 07:17:40 -0500
2 "Emma Strubell" <emma.strubell@×××××.com> wrote:
3
4 > However, I've started looking at the code, and I must admit I'm pretty
5 > overwhelmed! I don't know where to start. I was wondering if anyone
6 > on here could give me a quick overview of how the search function
7 > currently works, an idea as to what could be modified or implemented
8 > in order to improve the running time of this code, or any tip really
9 > as to where I should start or what I should start looking at. I'd
10 > really appreciate any help or advice!!
11
12 Well, it depends how much effort you want to put into this. The current
13 interface doesn't actually provide a "search" interface, but merely
14 functions to
15 1) list all package names - dbapi.cp_all()
16 2) list all package names and versions - dbapi.cpv_all()
17 3) list all versions for a given package name - dbapi.cp_list()
18 4) read metadata (like DESCRIPTION) for a given package name and
19 version - dbapi.aux_get()
20
21 One of the main performance problems of --search is that there is no
22 persistent cache for functions 1, 2 and 3, so if you're "just"
23 interested in performance aspects you might want to look into that.
24 The issue with implementing a persistent cache is that you have to
25 consider both cold and hot filesystem cache cases: Loading an index
26 file with package names and versions might improve the cold-cache case,
27 but slow things down when the filesystem cache is populated.
28 As has been mentioned, keeping the index updated is the other major
29 issue, especially as it has to be portable and should require little or
30 no configuration/setup for the user (so no extra daemons or special
31 filesystems running permanently in the background). The obvious
32 solution would be to generate the cache after `emerge --sync` (and other
33 sync implementations) and hope that people don't modify their tree and
34 search for the changes in between (that's what all the external tools
35 do). I don't know if there is actually a way to do online updates while
36 still improving performance and not relying on custom system daemons
37 running in the background.
38
39 As for --searchdesc, one problem is that dbapi.aux_get() can only
40 operate on a single package-version on each call (though it can read
41 multiple metadata variables). So for description searches the control
42 flow is like this (obviously simplified):
43
44 result = []
45 # iterate over all packages
46 for package in dbapi.cp_all():
47 # determine the current version of each package, this is
48 # another performance issue.
49 version = get_current_version(package)
50 # read package description from metadata cache
51 description = dbapi.aux_get(version, ["DESCRIPTION"])[0]
52 # check if the description matches
53 if matches(description, searchkey):
54 result.append(package)
55
56 There you see the three bottlenecks: the lack of a pregenerated package
57 list, the version lookup for *each* package and the actual metadata
58 read. I've already talked about the first, so lets look at the other
59 two. The core problem there is that DESCRIPTION (like all standard
60 metadata variables) is version specific, so to access it you need to
61 determine a version to use, even though in almost all cases the
62 description is the same (or very similar) for all versions. So the
63 proper solution would be to make the description a property of the
64 package name instead of the package version, but that's a _huge_ task
65 you're probably not interested in. What _might_ work here is to add
66 support for an optional package-name->description cache that can be
67 generated offline and includes those packages where all versions have
68 the same description, and fall back to the current method if the
69 package is not included in the cache. (Don't think about caching the
70 version lookup, that's system dependent and therefore not suitable for
71 caching).
72
73 Hope it has become clear that while the actual search algorithm might
74 be simple and not very efficient, the real problem lies in getting the
75 data to operate on.
76
77 That and the somewhat limited dbapi interface.
78
79 Disclaimer: The stuff below involves extending and redesigning some
80 core portage APIs. This isn't something you can do on a weekend, only
81 work on this if you want to commit yourself to portage development
82 for a long time.
83
84 The functions listed above are the bare minimum to
85 perform queries on the package repositories, but they're very
86 low-level. That means that whenever you want to select packages by
87 name, description, license, dependencies or other variables you need
88 quite a bit of custom code, more if you want to combine multiple
89 searches, and much more if you want to do it efficient and flexible.
90 See http://dev.gentoo.org/~genone/scripts/metalib.py and
91 http://dev.gentoo.org/~genone/scripts/metascan for a somewhat flexible,
92 but very inefficient search tool (might not work anymore due to old
93 age).
94
95 Ideally repository searches could be done without writing any
96 application code using some kind of query language, similar to how SQL
97 works for generic database searches (obviously not that complex).
98 But before thinking about that we'd need a query API that actually
99 a) allows tools to assemble queries without having to worry about
100 implementation details
101 b) run them efficiently without bothering the API user
102
103 Simple example: Find all package-versions in the sys-apps category that
104 are BSD-licensed.
105
106 Currently that would involve something like:
107
108 result = []
109 for package is dbapi.cp_all():
110 if not package.startswith("sys-apps/"):
111 continue
112 for version in dbapi.cp_list(package):
113 license = dbapi.aux_get(version, ["LICENSE"])[0]
114 # for simplicity perform a equivalence check, in reality you'd
115 # have to account for complex license definitions
116 if license == "BSD":
117 result.append(version)
118
119 Not very friendly to maintain, and not very efficient (we'd only need
120 to iterate over packages in the 'sys-apps' category, but the interface
121 doesn't allow that).
122 And now how it might look with a extensive query interface:
123
124 query = AndQuery()
125 query.add(CategoryQuery("sys-apps", FullStringMatch()))
126 query.add(MetadataQuery("BSD", FullStringMatch()))
127 result = repository.selectPackages(query)
128
129 Much nicer, don't you think?
130
131 As said, implementing such a thing would be a huge amount of work, even
132 if just implemented as wrappers on top of the current interface (which
133 would prevent many efficiency improvements), but if you (or anyone else
134 for that matter) are truly interested in this contact me off-list,
135 maybe I can find some of my old design ideas and (incomplete)
136 prototypes to give you a start.
137
138 Marius

Replies

Subject Author
Re: [gentoo-portage-dev] search functionality in emerge devsk <funtoos@×××××.com>