Gentoo Archives: gentoo-portage-dev

From: Brian Harring <ferringb@g.o>
To: gentoo-portage-dev@l.g.o
Subject: Re: [gentoo-portage-dev] [PATCH] metascan-core
Date: Wed, 14 Dec 2005 10:41:35
Message-Id: 20051214104040.GF10957@nightcrawler.e-centre.net
In Reply to: [gentoo-portage-dev] [PATCH] metascan-core by Marius Mauch
1 Not intending to be harsh in this email, keep in mind
2 this is _just_ technical angle here. What you're attempting
3 I think is needed, just seemajor issues with form presented.
4
5
6 On Sat, Dec 10, 2005 at 05:11:07AM +0100, Marius Mauch wrote:
7 > Attached patch adds a new function metascan() in portage.py and aliases
8 > it to portdbapi.metascan() and vardbapi.metascan(). It's basically the
9 > same code as in metalib.py (in private/genone/scripts), just mildly
10 > tweaked to work inside portage.py.
11 > It provides the core functionality for my metascan script which is now
12 > basically just a cli wrapper for the function.
13
14 I'd rather not see this go into portage.py- I don't think the
15 interface is up to snuff atm (design reasons tail end of the email,
16 implementation issues following inline).
17
18
19 > --- pym/portage.py 2005-11-13 14:55:16.000000000 +0100
20 > +++ pym/portage.py 2005-12-09 18:39:44.000000000 +0100
21 > @@ -4387,6 +4387,110 @@
22 > return myslot
23 >
24 >
25 > +def __metascan_resolve_dep_strings(cpv, keys, values, mysettings):
26 <snip>
27 > +def __metascan_strip_atoms(keys, values, mysettings):
28
29 We're (mostly) all consenting adults, namespace mangling probably
30 isn't required here (yes, I'm guilty of it occasionally also,
31 just don't think it's required here).
32
33
34 > +def metascan(db, keys, values, negated, mysettings, scanlist=None, operator="or", resolveDepStrings=False, stripAtoms=False, partial=False, regex=False, catlimit=None, debug=0, verbose=False):
35 > + """
36 > + Function to find all packages matching certain metadata criteria.
37 > + """
38
39 Doc string is pretty sparse. No explanation of what
40 keys/values/negated is supposed to be (list? tuple? dict?)- should be
41 knowable from docstring if this is going to be api. Same with
42 operator, although operator's string based args really dislike.
43
44 Basically... without reading over the function and your script for
45 examples of the calls, this func's prototype/docstring really isn't
46 usable without doing a lot of digging.
47
48 No comment on returns, either- basically, prototype seems like it's
49 overgrown, like it needs to be chunked up and simplified, plus actual
50 docstrings.
51
52
53 General code comments, trailing comments on design...
54
55 > + if len(keys) != len(values) or len(keys) != len(negated):
56 > + raise IndexError("argument length mismatch")
57 > +
58 > + if scanlist == None:
59 > + scanlist = {}
60 > + plist = db.cp_all()
61 > + for p in plist:
62 > + scanlist[p] = []
63 > + for pv in db.cp_list(p):
64 > + scanlist[p].append(pv)
65
66 This is pretty icky; you're scanning the entire tree in a single run,
67 building it *all* up in memory prior to even doing the actual
68 matching.
69
70 Fex, no keys, and this will force a full scan for nothing.
71 Use a generator.
72
73 > +
74 > + resultlist = []
75 > +
76 > + for p in scanlist:
77 > + if debug > 1:
78 > + sys.stderr.write("Scanning package %s\n" % p)
79 > + # check if we have a category restriction and if that's the case, skip this package if we don't have a match
80 > + if catlimit != None and catsplit(p)[0] not in catlimit:
81 > + if debug > 1:
82 > + sys.stderr.write("Skipping package %s from category %s due to category limit (%s)\n" % (p, catsplit(p)[0], str(catlimit)))
83 > + continue
84 > + for pv in scanlist[p]:
85 > + try:
86 > + result = []
87 > + # this is the slow part, also generates noise if portage cache out of date
88 > + pvalues = db.aux_get(pv, keys)
89 > +
90 > + except KeyError:
91 > + sys.stderr.write("Error while scanning %s\n" % pv)
92
93 No lib code should *ever* be dumping to stderr on it's own for non
94 fatal errors.
95
96 Yes portage violates it, but portage code also sucks.
97
98 > + continue
99 > +
100 > + # save original values for debug
101 > + if debug > 1:
102 > + keys_uniq = []
103 > + pvalues_orig = []
104 > + for i in range(0, len(keys)):
105 > + if not keys[i] in keys_uniq:
106 > + keys_uniq.append(keys[i])
107 > + pvalues_orig.append(pvalues[i])
108 Totally unnecessary here. Do a *single* unique call of keys rather
109 then doing it 20,000+ times for each cpv. Same thing for the
110 db.aux_get call above.
111
112
113 > +
114 > + # remove conditional deps whose coditions aren't met
115 > + if resolveDepStrings:
116 > + pvalues = __metascan_resolve_dep_strings(pv, keys, pvalues, settings)
117 > +
118 > + # we're only interested in the cat/pkg stuff from an atom here, so strip the rest
119 > + if stripAtoms:
120 > + pvalues = __metascan_strip_atoms(keys, pvalues, settings)
121 > +
122 > + # report also partial matches, e.g. "foo" in "foomatic"
123 > + if partial or regex:
124 > + for i in range(0, len(pvalues)):
125
126 the 0 initial arg isn't needed.
127
128 > + result.append((partial and pvalues[i].find(values[i]) >= 0) \
129 > + or (regex and bool(re.match(values[i], pvalues[i]))))
130
131 Should be caching regex compilations instead of continual
132 recompilation. Speed up there...
133
134 > +
135 > + # we're only interested in full matches in general
136 > + else:
137 > + result = [values[i] in pvalues[i].split() for i in range(0, len(pvalues))]
138 > +
139 > + # some funky logic operations to invert the adjust the match if negations were requested
140 > + result = [(negated[i] and not result[i]) or (not negated[i] and result[i]) for i in range(0, len(result))]
141 xor baby... xor.
142
143 > +
144 > + # more logic stuff for conjunction or disjunction
145 > + if (operator == "or" and True in result) or (operator == "and" and not False in result):
146 > + if debug > 0:
147 > + sys.stderr.write("Match found: %s\n" % pv)
148 > + if debug > 1:
149 > + for i in range(0, len(keys_uniq)):
150 > + sys.stderr.write("%s from %s: %s\n" % (keys_uniq[i], pv, pvalues_orig[i]))
151 > + if verbose:
152 > + resultlist.append(pv)
153 > + else:
154 > + if not p in resultlist:
155 > + resultlist.append(p)
156
157 No no no no no... use a set or a dict. While it's easy/obvious code
158 wise, don't do linear searches like that unless you know the set is
159 going to be _small_ (sub 10, although timeit runs would bear out a
160 more accurate figure). Little stuff like this when removed provides a
161 _lot_ of speedups- little stuff like this is why people piss and moan
162 python is slow when it's usually bad algo's that are at fault...
163
164 > + return resultlist
165
166 this should be a generator, and yielding. Unless I've missed
167 something obvious from above, this code doesn't have any reason to
168 build up a list of returns like this instead of just yielding as it
169 determines matches. Plus side, decreased interim mem usage, and
170 calling code can display to the user immediately on returns.
171 Con? Have to change all your appends. Your boolean logic
172 implementations are going to get complex also, since you'll have to
173 maintain a stack of potential matches.
174
175
176 General commentary... this isn't flexible enough. Extension of the
177 matching code is dependant on extension of the metascan func- adding
178 an xor (fex) isn't viable without modifying the code.
179
180 Further, you can't do subclass matching without abusing scanlist- this
181 is an indication (imo) that the interface needs an overhaul.
182
183 To pull off arbitrary matching, fex
184 ( ( category = blah or package = blah ) and ( description = dar ) )
185 requires 3 seperate calls raiding from the returnlist and handing
186 it back in as scanlist. This I view as a general design failure;
187 what's needed is a way to arbitrarily combine restrictions/search
188 criteria.
189
190 This base functionality, wrapped in some serious voodoo classes could
191 pull it off although maintaining it would be ugly. Still would have
192 all of the matching logic embedded in a central func though, which
193 isn't extensible.
194
195 I think you're coming at this from the wrong angle- I'd advocate the
196 restriction subsystem design that is sitting in saviour frankly.
197 Encapsulate the individual restrictions, and encapsulate boolean
198 matching logic within classes.
199
200 Go the route the restriction subsystem inherintely allows, you've got
201 extensibility up the ying yang- can use either base classes provided,
202 or provide your own nonstandard matching/restrictions and just slap it
203 in. The restriction subsystem _is_ pretty damn close to the pquery
204 language you'd talk of a long while back, the only chunk missing from
205 actually having your pquery language is a tokenizer that converts
206 sql/pquery akin strings into restriction elements that can be used for
207 matching.
208
209 Pros of the saviour restriction subsystem?
210 1) it covers *all* matching. atom look ups included. IOW, we don't
211 have two methods of searching/lookup, we have a single point to
212 optimize the hell out of.
213 2) arbitrary depths of boolean terms built in via the design, no dance
214 required by callers to filter down lists, managing boolean logic
215 itself.
216
217 Cons?
218 1) relies on an actual package agnostic class- iow, whatever the
219 format the 'package' is, it would still work with it. That said,
220 format classes have to be have a basic common api/set of attributes.
221 2) query optimization can be a bitch. You've got a tree of
222 restrictions, an optimizer that is capable of identifying subset of
223 categories/packages instead of just trying all nodes would be useful
224 (speed things up by reducing the set of potential matches).
225 3) bound to repo design, which also binds config crap to per package.
226 This is actually a pro from a design/maintenance angle, but is a con
227 since it's not something easily retrofitted into stable (note I'm
228 reversing my stance here, I think it's doable although it's going to
229 require a helluva lot of work).
230
231 So... why am I whoring the restriction subsystem? Well, it's the
232 culmination of our talks. It *is* effectively a pquery implementation
233 at it's core. It's what we originally talked about as being
234 desirable, best route to go- hell, even the string output is inline
235 with your original pquery syntax (sql like).
236
237 Matching code being pushed into stable I'm for, but I'd rather see an
238 api/framework pushed in that is designed for long term (ab)use. The
239 patch provided is usable for your script, but any code that wanted to
240 do similar complex scans is forced to duplicate the boolean crap-
241 that's a really bad thing, something portage already suffers horribly
242 from.
243
244 Any matching functionality pushed into portage is going to be used by
245 porthole and other portage consumers- the api _does_ need to be pretty
246 sane (even if we may redesign everything else down the line).
247 Complex search query capabilities in the consumer will be invariably
248 bound to the underlying portage implementation, although hopefully at
249 least somewhat decoupled- hence my belief this particular area _needs_
250 to be right from the get go.
251
252 So... my 2 cents, which admittedly can be construed as biased since
253 I've already made the plunge with the restriction subsystem.
254
255 Just think it's the proper way to go; centralized design route you're
256 going here I think is going to lead to unmaintainable code the more
257 power that's jimmied into it.
258 ~harring