1 |
Not intending to be harsh in this email, keep in mind |
2 |
this is _just_ technical angle here. What you're attempting |
3 |
I think is needed, just seemajor issues with form presented. |
4 |
|
5 |
|
6 |
On Sat, Dec 10, 2005 at 05:11:07AM +0100, Marius Mauch wrote: |
7 |
> Attached patch adds a new function metascan() in portage.py and aliases |
8 |
> it to portdbapi.metascan() and vardbapi.metascan(). It's basically the |
9 |
> same code as in metalib.py (in private/genone/scripts), just mildly |
10 |
> tweaked to work inside portage.py. |
11 |
> It provides the core functionality for my metascan script which is now |
12 |
> basically just a cli wrapper for the function. |
13 |
|
14 |
I'd rather not see this go into portage.py- I don't think the |
15 |
interface is up to snuff atm (design reasons tail end of the email, |
16 |
implementation issues following inline). |
17 |
|
18 |
|
19 |
> --- pym/portage.py 2005-11-13 14:55:16.000000000 +0100 |
20 |
> +++ pym/portage.py 2005-12-09 18:39:44.000000000 +0100 |
21 |
> @@ -4387,6 +4387,110 @@ |
22 |
> return myslot |
23 |
> |
24 |
> |
25 |
> +def __metascan_resolve_dep_strings(cpv, keys, values, mysettings): |
26 |
<snip> |
27 |
> +def __metascan_strip_atoms(keys, values, mysettings): |
28 |
|
29 |
We're (mostly) all consenting adults, namespace mangling probably |
30 |
isn't required here (yes, I'm guilty of it occasionally also, |
31 |
just don't think it's required here). |
32 |
|
33 |
|
34 |
> +def metascan(db, keys, values, negated, mysettings, scanlist=None, operator="or", resolveDepStrings=False, stripAtoms=False, partial=False, regex=False, catlimit=None, debug=0, verbose=False): |
35 |
> + """ |
36 |
> + Function to find all packages matching certain metadata criteria. |
37 |
> + """ |
38 |
|
39 |
Doc string is pretty sparse. No explanation of what |
40 |
keys/values/negated is supposed to be (list? tuple? dict?)- should be |
41 |
knowable from docstring if this is going to be api. Same with |
42 |
operator, although operator's string based args really dislike. |
43 |
|
44 |
Basically... without reading over the function and your script for |
45 |
examples of the calls, this func's prototype/docstring really isn't |
46 |
usable without doing a lot of digging. |
47 |
|
48 |
No comment on returns, either- basically, prototype seems like it's |
49 |
overgrown, like it needs to be chunked up and simplified, plus actual |
50 |
docstrings. |
51 |
|
52 |
|
53 |
General code comments, trailing comments on design... |
54 |
|
55 |
> + if len(keys) != len(values) or len(keys) != len(negated): |
56 |
> + raise IndexError("argument length mismatch") |
57 |
> + |
58 |
> + if scanlist == None: |
59 |
> + scanlist = {} |
60 |
> + plist = db.cp_all() |
61 |
> + for p in plist: |
62 |
> + scanlist[p] = [] |
63 |
> + for pv in db.cp_list(p): |
64 |
> + scanlist[p].append(pv) |
65 |
|
66 |
This is pretty icky; you're scanning the entire tree in a single run, |
67 |
building it *all* up in memory prior to even doing the actual |
68 |
matching. |
69 |
|
70 |
Fex, no keys, and this will force a full scan for nothing. |
71 |
Use a generator. |
72 |
|
73 |
> + |
74 |
> + resultlist = [] |
75 |
> + |
76 |
> + for p in scanlist: |
77 |
> + if debug > 1: |
78 |
> + sys.stderr.write("Scanning package %s\n" % p) |
79 |
> + # check if we have a category restriction and if that's the case, skip this package if we don't have a match |
80 |
> + if catlimit != None and catsplit(p)[0] not in catlimit: |
81 |
> + if debug > 1: |
82 |
> + sys.stderr.write("Skipping package %s from category %s due to category limit (%s)\n" % (p, catsplit(p)[0], str(catlimit))) |
83 |
> + continue |
84 |
> + for pv in scanlist[p]: |
85 |
> + try: |
86 |
> + result = [] |
87 |
> + # this is the slow part, also generates noise if portage cache out of date |
88 |
> + pvalues = db.aux_get(pv, keys) |
89 |
> + |
90 |
> + except KeyError: |
91 |
> + sys.stderr.write("Error while scanning %s\n" % pv) |
92 |
|
93 |
No lib code should *ever* be dumping to stderr on it's own for non |
94 |
fatal errors. |
95 |
|
96 |
Yes portage violates it, but portage code also sucks. |
97 |
|
98 |
> + continue |
99 |
> + |
100 |
> + # save original values for debug |
101 |
> + if debug > 1: |
102 |
> + keys_uniq = [] |
103 |
> + pvalues_orig = [] |
104 |
> + for i in range(0, len(keys)): |
105 |
> + if not keys[i] in keys_uniq: |
106 |
> + keys_uniq.append(keys[i]) |
107 |
> + pvalues_orig.append(pvalues[i]) |
108 |
Totally unnecessary here. Do a *single* unique call of keys rather |
109 |
then doing it 20,000+ times for each cpv. Same thing for the |
110 |
db.aux_get call above. |
111 |
|
112 |
|
113 |
> + |
114 |
> + # remove conditional deps whose coditions aren't met |
115 |
> + if resolveDepStrings: |
116 |
> + pvalues = __metascan_resolve_dep_strings(pv, keys, pvalues, settings) |
117 |
> + |
118 |
> + # we're only interested in the cat/pkg stuff from an atom here, so strip the rest |
119 |
> + if stripAtoms: |
120 |
> + pvalues = __metascan_strip_atoms(keys, pvalues, settings) |
121 |
> + |
122 |
> + # report also partial matches, e.g. "foo" in "foomatic" |
123 |
> + if partial or regex: |
124 |
> + for i in range(0, len(pvalues)): |
125 |
|
126 |
the 0 initial arg isn't needed. |
127 |
|
128 |
> + result.append((partial and pvalues[i].find(values[i]) >= 0) \ |
129 |
> + or (regex and bool(re.match(values[i], pvalues[i])))) |
130 |
|
131 |
Should be caching regex compilations instead of continual |
132 |
recompilation. Speed up there... |
133 |
|
134 |
> + |
135 |
> + # we're only interested in full matches in general |
136 |
> + else: |
137 |
> + result = [values[i] in pvalues[i].split() for i in range(0, len(pvalues))] |
138 |
> + |
139 |
> + # some funky logic operations to invert the adjust the match if negations were requested |
140 |
> + result = [(negated[i] and not result[i]) or (not negated[i] and result[i]) for i in range(0, len(result))] |
141 |
xor baby... xor. |
142 |
|
143 |
> + |
144 |
> + # more logic stuff for conjunction or disjunction |
145 |
> + if (operator == "or" and True in result) or (operator == "and" and not False in result): |
146 |
> + if debug > 0: |
147 |
> + sys.stderr.write("Match found: %s\n" % pv) |
148 |
> + if debug > 1: |
149 |
> + for i in range(0, len(keys_uniq)): |
150 |
> + sys.stderr.write("%s from %s: %s\n" % (keys_uniq[i], pv, pvalues_orig[i])) |
151 |
> + if verbose: |
152 |
> + resultlist.append(pv) |
153 |
> + else: |
154 |
> + if not p in resultlist: |
155 |
> + resultlist.append(p) |
156 |
|
157 |
No no no no no... use a set or a dict. While it's easy/obvious code |
158 |
wise, don't do linear searches like that unless you know the set is |
159 |
going to be _small_ (sub 10, although timeit runs would bear out a |
160 |
more accurate figure). Little stuff like this when removed provides a |
161 |
_lot_ of speedups- little stuff like this is why people piss and moan |
162 |
python is slow when it's usually bad algo's that are at fault... |
163 |
|
164 |
> + return resultlist |
165 |
|
166 |
this should be a generator, and yielding. Unless I've missed |
167 |
something obvious from above, this code doesn't have any reason to |
168 |
build up a list of returns like this instead of just yielding as it |
169 |
determines matches. Plus side, decreased interim mem usage, and |
170 |
calling code can display to the user immediately on returns. |
171 |
Con? Have to change all your appends. Your boolean logic |
172 |
implementations are going to get complex also, since you'll have to |
173 |
maintain a stack of potential matches. |
174 |
|
175 |
|
176 |
General commentary... this isn't flexible enough. Extension of the |
177 |
matching code is dependant on extension of the metascan func- adding |
178 |
an xor (fex) isn't viable without modifying the code. |
179 |
|
180 |
Further, you can't do subclass matching without abusing scanlist- this |
181 |
is an indication (imo) that the interface needs an overhaul. |
182 |
|
183 |
To pull off arbitrary matching, fex |
184 |
( ( category = blah or package = blah ) and ( description = dar ) ) |
185 |
requires 3 seperate calls raiding from the returnlist and handing |
186 |
it back in as scanlist. This I view as a general design failure; |
187 |
what's needed is a way to arbitrarily combine restrictions/search |
188 |
criteria. |
189 |
|
190 |
This base functionality, wrapped in some serious voodoo classes could |
191 |
pull it off although maintaining it would be ugly. Still would have |
192 |
all of the matching logic embedded in a central func though, which |
193 |
isn't extensible. |
194 |
|
195 |
I think you're coming at this from the wrong angle- I'd advocate the |
196 |
restriction subsystem design that is sitting in saviour frankly. |
197 |
Encapsulate the individual restrictions, and encapsulate boolean |
198 |
matching logic within classes. |
199 |
|
200 |
Go the route the restriction subsystem inherintely allows, you've got |
201 |
extensibility up the ying yang- can use either base classes provided, |
202 |
or provide your own nonstandard matching/restrictions and just slap it |
203 |
in. The restriction subsystem _is_ pretty damn close to the pquery |
204 |
language you'd talk of a long while back, the only chunk missing from |
205 |
actually having your pquery language is a tokenizer that converts |
206 |
sql/pquery akin strings into restriction elements that can be used for |
207 |
matching. |
208 |
|
209 |
Pros of the saviour restriction subsystem? |
210 |
1) it covers *all* matching. atom look ups included. IOW, we don't |
211 |
have two methods of searching/lookup, we have a single point to |
212 |
optimize the hell out of. |
213 |
2) arbitrary depths of boolean terms built in via the design, no dance |
214 |
required by callers to filter down lists, managing boolean logic |
215 |
itself. |
216 |
|
217 |
Cons? |
218 |
1) relies on an actual package agnostic class- iow, whatever the |
219 |
format the 'package' is, it would still work with it. That said, |
220 |
format classes have to be have a basic common api/set of attributes. |
221 |
2) query optimization can be a bitch. You've got a tree of |
222 |
restrictions, an optimizer that is capable of identifying subset of |
223 |
categories/packages instead of just trying all nodes would be useful |
224 |
(speed things up by reducing the set of potential matches). |
225 |
3) bound to repo design, which also binds config crap to per package. |
226 |
This is actually a pro from a design/maintenance angle, but is a con |
227 |
since it's not something easily retrofitted into stable (note I'm |
228 |
reversing my stance here, I think it's doable although it's going to |
229 |
require a helluva lot of work). |
230 |
|
231 |
So... why am I whoring the restriction subsystem? Well, it's the |
232 |
culmination of our talks. It *is* effectively a pquery implementation |
233 |
at it's core. It's what we originally talked about as being |
234 |
desirable, best route to go- hell, even the string output is inline |
235 |
with your original pquery syntax (sql like). |
236 |
|
237 |
Matching code being pushed into stable I'm for, but I'd rather see an |
238 |
api/framework pushed in that is designed for long term (ab)use. The |
239 |
patch provided is usable for your script, but any code that wanted to |
240 |
do similar complex scans is forced to duplicate the boolean crap- |
241 |
that's a really bad thing, something portage already suffers horribly |
242 |
from. |
243 |
|
244 |
Any matching functionality pushed into portage is going to be used by |
245 |
porthole and other portage consumers- the api _does_ need to be pretty |
246 |
sane (even if we may redesign everything else down the line). |
247 |
Complex search query capabilities in the consumer will be invariably |
248 |
bound to the underlying portage implementation, although hopefully at |
249 |
least somewhat decoupled- hence my belief this particular area _needs_ |
250 |
to be right from the get go. |
251 |
|
252 |
So... my 2 cents, which admittedly can be construed as biased since |
253 |
I've already made the plunge with the restriction subsystem. |
254 |
|
255 |
Just think it's the proper way to go; centralized design route you're |
256 |
going here I think is going to lead to unmaintainable code the more |
257 |
power that's jimmied into it. |
258 |
~harring |