Gentoo Archives: gentoo-portage-dev

From: devsk <funtoos@×××××.com>
To: gentoo-portage-dev@l.g.o
Subject: Re: [gentoo-portage-dev] search functionality in emerge
Date: Mon, 24 Nov 2008 05:01:43
In Reply to: Re: [gentoo-portage-dev] search functionality in emerge by Marius Mauch
1 > not relying on custom system daemonsrunning in the background.
3 Why is a portage daemon such a bad thing? Or hard to do? I would very much like a daemon running on my system which I can configure to sync the portage tree once a week (or month if I am lazy), give me a summary of hot fixes, security fixes in a nice email, push important announcements and of course, sync caches on detecting changes (which should be trivial with notify daemons all over the place) etc. Why is it such a bad thing?
5 Its crazy to think that security updates need to be pulled in Linux.
7 -devsk
11 ----- Original Message ----
12 From: Marius Mauch <genone@g.o>
13 To: gentoo-portage-dev@l.g.o
14 Sent: Sunday, November 23, 2008 7:12:57 PM
15 Subject: Re: [gentoo-portage-dev] search functionality in emerge
17 On Sun, 23 Nov 2008 07:17:40 -0500
18 "Emma Strubell" <emma.strubell@×××××.com> wrote:
20 > However, I've started looking at the code, and I must admit I'm pretty
21 > overwhelmed! I don't know where to start. I was wondering if anyone
22 > on here could give me a quick overview of how the search function
23 > currently works, an idea as to what could be modified or implemented
24 > in order to improve the running time of this code, or any tip really
25 > as to where I should start or what I should start looking at. I'd
26 > really appreciate any help or advice!!
28 Well, it depends how much effort you want to put into this. The current
29 interface doesn't actually provide a "search" interface, but merely
30 functions to
31 1) list all package names - dbapi.cp_all()
32 2) list all package names and versions - dbapi.cpv_all()
33 3) list all versions for a given package name - dbapi.cp_list()
34 4) read metadata (like DESCRIPTION) for a given package name and
35 version - dbapi.aux_get()
37 One of the main performance problems of --search is that there is no
38 persistent cache for functions 1, 2 and 3, so if you're "just"
39 interested in performance aspects you might want to look into that.
40 The issue with implementing a persistent cache is that you have to
41 consider both cold and hot filesystem cache cases: Loading an index
42 file with package names and versions might improve the cold-cache case,
43 but slow things down when the filesystem cache is populated.
44 As has been mentioned, keeping the index updated is the other major
45 issue, especially as it has to be portable and should require little or
46 no configuration/setup for the user (so no extra daemons or special
47 filesystems running permanently in the background). The obvious
48 solution would be to generate the cache after `emerge --sync` (and other
49 sync implementations) and hope that people don't modify their tree and
50 search for the changes in between (that's what all the external tools
51 do). I don't know if there is actually a way to do online updates while
52 still improving performance and not relying on custom system daemons
53 running in the background.
55 As for --searchdesc, one problem is that dbapi.aux_get() can only
56 operate on a single package-version on each call (though it can read
57 multiple metadata variables). So for description searches the control
58 flow is like this (obviously simplified):
60 result = []
61 # iterate over all packages
62 for package in dbapi.cp_all():
63 # determine the current version of each package, this is
64 # another performance issue.
65 version = get_current_version(package)
66 # read package description from metadata cache
67 description = dbapi.aux_get(version, ["DESCRIPTION"])[0]
68 # check if the description matches
69 if matches(description, searchkey):
70 result.append(package)
72 There you see the three bottlenecks: the lack of a pregenerated package
73 list, the version lookup for *each* package and the actual metadata
74 read. I've already talked about the first, so lets look at the other
75 two. The core problem there is that DESCRIPTION (like all standard
76 metadata variables) is version specific, so to access it you need to
77 determine a version to use, even though in almost all cases the
78 description is the same (or very similar) for all versions. So the
79 proper solution would be to make the description a property of the
80 package name instead of the package version, but that's a _huge_ task
81 you're probably not interested in. What _might_ work here is to add
82 support for an optional package-name->description cache that can be
83 generated offline and includes those packages where all versions have
84 the same description, and fall back to the current method if the
85 package is not included in the cache. (Don't think about caching the
86 version lookup, that's system dependent and therefore not suitable for
87 caching).
89 Hope it has become clear that while the actual search algorithm might
90 be simple and not very efficient, the real problem lies in getting the
91 data to operate on.
93 That and the somewhat limited dbapi interface.
95 Disclaimer: The stuff below involves extending and redesigning some
96 core portage APIs. This isn't something you can do on a weekend, only
97 work on this if you want to commit yourself to portage development
98 for a long time.
100 The functions listed above are the bare minimum to
101 perform queries on the package repositories, but they're very
102 low-level. That means that whenever you want to select packages by
103 name, description, license, dependencies or other variables you need
104 quite a bit of custom code, more if you want to combine multiple
105 searches, and much more if you want to do it efficient and flexible.
106 See and
107 for a somewhat flexible,
108 but very inefficient search tool (might not work anymore due to old
109 age).
111 Ideally repository searches could be done without writing any
112 application code using some kind of query language, similar to how SQL
113 works for generic database searches (obviously not that complex).
114 But before thinking about that we'd need a query API that actually
115 a) allows tools to assemble queries without having to worry about
116 implementation details
117 b) run them efficiently without bothering the API user
119 Simple example: Find all package-versions in the sys-apps category that
120 are BSD-licensed.
122 Currently that would involve something like:
124 result = []
125 for package is dbapi.cp_all():
126 if not package.startswith("sys-apps/"):
127 continue
128 for version in dbapi.cp_list(package):
129 license = dbapi.aux_get(version, ["LICENSE"])[0]
130 # for simplicity perform a equivalence check, in reality you'd
131 # have to account for complex license definitions
132 if license == "BSD":
133 result.append(version)
135 Not very friendly to maintain, and not very efficient (we'd only need
136 to iterate over packages in the 'sys-apps' category, but the interface
137 doesn't allow that).
138 And now how it might look with a extensive query interface:
140 query = AndQuery()
141 query.add(CategoryQuery("sys-apps", FullStringMatch()))
142 query.add(MetadataQuery("BSD", FullStringMatch()))
143 result = repository.selectPackages(query)
145 Much nicer, don't you think?
147 As said, implementing such a thing would be a huge amount of work, even
148 if just implemented as wrappers on top of the current interface (which
149 would prevent many efficiency improvements), but if you (or anyone else
150 for that matter) are truly interested in this contact me off-list,
151 maybe I can find some of my old design ideas and (incomplete)
152 prototypes to give you a start.
154 Marius


Subject Author
Re: [gentoo-portage-dev] search functionality in emerge Marius Mauch <genone@g.o>
[gentoo-portage-dev] Re: search functionality in emerge Duncan <1i5t5.duncan@×××.net>