Re: [gentoo-portage-dev] [PATCH] metascan-core - gentoo-portage-dev

From:	Brian Harring <ferringb@g.o>
To:	gentoo-portage-dev@l.g.o
Subject:	Re: [gentoo-portage-dev] [PATCH] metascan-core
Date:	Wed, 14 Dec 2005 10:41:35
Message-Id:	`20051214104040.GF10957@nightcrawler.e-centre.net`
In Reply to:	[gentoo-portage-dev] [PATCH] metascan-core by Marius Mauch

1

Not intending to be harsh in this email, keep in mind 

2

this is _just_ technical angle here.  What you're attempting 

3

I think is needed, just seemajor  issues with form presented.

4

5

6

On Sat, Dec 10, 2005 at 05:11:07AM +0100, Marius Mauch wrote:

7

> Attached patch adds a new function metascan() in portage.py and aliases

8

> it to portdbapi.metascan() and vardbapi.metascan(). It's basically the

9

> same code as in metalib.py (in private/genone/scripts), just mildly

10

> tweaked to work inside portage.py.

11

> It provides the core functionality for my metascan script which is now

12

> basically just a cli wrapper for the function.

13

14

I'd rather not see this go into portage.py- I don't think the 

15

interface is up to snuff atm (design reasons tail end of the email, 

16

implementation issues following inline).

17

18

19

> --- pym/portage.py	2005-11-13 14:55:16.000000000 +0100

20

> +++ pym/portage.py	2005-12-09 18:39:44.000000000 +0100

21

> @@ -4387,6 +4387,110 @@

22

>  		return myslot

23

>

24

>

25

> +def __metascan_resolve_dep_strings(cpv, keys, values, mysettings):

26

<snip>

27

> +def __metascan_strip_atoms(keys, values, mysettings):

28

29

We're (mostly) all consenting adults, namespace mangling probably 

30

isn't required here (yes, I'm guilty of it occasionally also, 

31

just don't think it's required here).

32

33

34

> +def metascan(db, keys, values, negated, mysettings, scanlist=None, operator="or", resolveDepStrings=False, stripAtoms=False, partial=False, regex=False, catlimit=None, debug=0, verbose=False):

35

> +	"""

36

> +	Function to find all packages matching certain metadata criteria.

37

> +	"""

38

39

Doc string is pretty sparse.  No explanation of what 

40

keys/values/negated is supposed to be (list?  tuple? dict?)- should be 

41

knowable from docstring if this is going to be api.  Same with 

42

operator, although operator's string based args really dislike.

43

44

Basically... without reading over the function and your script for 

45

examples of the calls, this func's prototype/docstring really isn't 

46

usable without doing a lot of digging.

47

48

No comment on returns, either- basically, prototype seems like it's 

49

overgrown, like it needs to be chunked up and simplified, plus actual 

50

docstrings.

51

52

53

General code comments, trailing comments on design...

54

55

> +	if len(keys) != len(values) or len(keys) != len(negated):

56

> +		raise IndexError("argument length mismatch")

57

> +

58

> +	if scanlist == None:

59

> +		scanlist = {}

60

> +		plist = db.cp_all()

61

> +		for p in plist:

62

> +			scanlist[p] = []

63

> +			for pv in db.cp_list(p):

64

> +				scanlist[p].append(pv)

65

66

This is pretty icky; you're scanning the entire tree in a single run, 

67

building it *all* up in memory prior to even doing the actual 

68

matching.

69

70

Fex, no keys, and this will force a full scan for nothing.

71

Use a generator.

72

73

> +

74

> +	resultlist = []

75

> +

76

> +	for p in scanlist:

77

> +		if debug > 1:

78

> +			sys.stderr.write("Scanning package %s\n" % p)

79

> +		# check if we have a category restriction and if that's the case, skip this package if we don't have a match

80

> +		if catlimit != None and catsplit(p)[0] not in catlimit:

81

> +			if debug > 1:

82

> +				sys.stderr.write("Skipping package %s from category %s due to category limit (%s)\n" % (p, catsplit(p)[0], str(catlimit)))

83

> +			continue

84

> +		for pv in scanlist[p]:

85

> +			try:

86

> +				result = []

87

> +				# this is the slow part, also generates noise if portage cache out of date

88

> +				pvalues = db.aux_get(pv, keys)

89

> +

90

> +			except KeyError:

91

> +				sys.stderr.write("Error while scanning %s\n" % pv)

92

93

No lib code should *ever* be dumping to stderr on it's own for non 

94

fatal errors.  

95

96

Yes portage violates it, but portage code also sucks.

97

98

> +				continue

99

> +

100

> +			# save original values for debug

101

> +			if debug > 1:

102

> +				keys_uniq = []

103

> +				pvalues_orig = []

104

> +				for i in range(0, len(keys)):

105

> +					if not keys[i] in keys_uniq:

106

> +						keys_uniq.append(keys[i])

107

> +						pvalues_orig.append(pvalues[i])

108

Totally unnecessary here.  Do a *single* unique call of keys rather 

109

then doing it 20,000+ times for each cpv.  Same thing for the 

110

db.aux_get call above.

111

112

113

> +

114

> +			# remove conditional deps whose coditions aren't met

115

> +			if resolveDepStrings:

116

> +				pvalues = __metascan_resolve_dep_strings(pv, keys, pvalues, settings)

117

> +

118

> +			# we're only interested in the cat/pkg stuff from an atom here, so strip the rest

119

> +			if stripAtoms:

120

> +				pvalues = __metascan_strip_atoms(keys, pvalues, settings)

121

> +

122

> +			# report also partial matches, e.g. "foo" in "foomatic"

123

> +			if partial or regex:

124

> +				for i in range(0, len(pvalues)):

125

126

the 0 initial arg isn't needed.

127

128

> +					result.append((partial and pvalues[i].find(values[i]) >= 0) \

129

> +							or (regex and bool(re.match(values[i], pvalues[i]))))

130

131

Should be caching regex compilations instead of continual 

132

recompilation.  Speed up there...

133

134

> +

135

> +			# we're only interested in full matches in general

136

> +			else:

137

> +				result = [values[i] in pvalues[i].split() for i in range(0, len(pvalues))]

138

> +

139

> +			# some funky logic operations to invert the adjust the match if negations were requested

140

> +			result = [(negated[i] and not result[i]) or (not negated[i] and result[i]) for i in range(0, len(result))]

141

xor baby... xor.

142

143

> +

144

> +			# more logic stuff for conjunction or disjunction

145

> +			if (operator == "or" and True in result) or (operator == "and" and not False in result):

146

> +				if debug > 0:

147

> +					sys.stderr.write("Match found: %s\n" % pv)

148

> +				if debug > 1:

149

> +					for i in range(0, len(keys_uniq)):

150

> +						sys.stderr.write("%s from %s: %s\n" % (keys_uniq[i], pv, pvalues_orig[i]))

151

> +				if verbose:

152

> +					resultlist.append(pv)

153

> +				else:

154

> +					if not p in resultlist:

155

> +						resultlist.append(p)

156

157

No no no no no... use a set or a dict.  While it's easy/obvious code 

158

wise, don't do linear searches like that unless you know the set is 

159

going to be _small_ (sub 10, although timeit runs would bear out a 

160

more accurate figure).  Little stuff like this when removed provides a 

161

_lot_ of speedups- little stuff like this is why people piss and moan 

162

python is slow when it's usually bad algo's that are at fault...

163

164

> +	return resultlist

165

166

this should be a generator, and yielding.  Unless I've missed 

167

something obvious from above, this code doesn't have any reason to 

168

build up a list of returns like this instead of just yielding as it 

169

determines matches.  Plus side, decreased interim mem usage, and 

170

calling code can display to the user immediately on returns.

171

Con?  Have to change all your appends.  Your boolean logic 

172

implementations are going to get complex also, since you'll have to 

173

maintain a stack of potential matches.

174

175

176

General commentary... this isn't flexible enough.  Extension of the 

177

matching code is dependant on extension of the metascan func- adding 

178

an xor (fex) isn't viable without modifying the code.

179

180

Further, you can't do subclass matching without abusing scanlist- this 

181

is an indication (imo) that the interface needs an overhaul.

182

183

To pull off arbitrary matching, fex

184

( ( category = blah or package = blah ) and ( description = dar ) )

185

requires 3 seperate calls raiding from the returnlist and handing 

186

it back in as scanlist.  This I view as a general design failure; 

187

what's needed is a way to arbitrarily combine restrictions/search 

188

criteria.

189

190

This base functionality, wrapped in some serious voodoo classes could 

191

pull it off although maintaining it would be ugly.  Still would have 

192

all of the matching logic embedded in a central func though, which 

193

isn't extensible.

194

195

I think you're coming at this from the wrong angle- I'd advocate the 

196

restriction subsystem design that is sitting in saviour frankly.  

197

Encapsulate the individual restrictions, and encapsulate boolean 

198

matching logic within classes.

199

200

Go the route the restriction subsystem inherintely allows, you've got 

201

extensibility up the ying yang- can use either base classes provided, 

202

or provide your own nonstandard matching/restrictions and just slap it 

203

in.  The restriction subsystem _is_ pretty damn close to the pquery 

204

language you'd talk of a long while back, the only chunk missing from 

205

actually having your pquery language is a tokenizer that converts 

206

sql/pquery akin strings into restriction elements that can be used for 

207

matching.

208

209

Pros of the saviour restriction subsystem?

210

1) it covers *all* matching.  atom look ups included.  IOW, we don't 

211

have two methods of searching/lookup, we have a single point to 

212

optimize the hell out of.

213

2) arbitrary depths of boolean terms built in via the design, no dance 

214

required by callers to filter down lists, managing boolean logic 

215

itself.

216

217

Cons?

218

1) relies on an actual package agnostic class- iow, whatever the 

219

format the 'package' is, it would still work with it.  That said, 

220

format classes have to be have a basic common api/set of attributes.

221

2) query optimization can be a bitch.  You've got a tree of 

222

restrictions, an optimizer that is capable of identifying subset of 

223

categories/packages instead of just trying all nodes would be useful 

224

(speed things up by reducing the set of potential matches).

225

3) bound to repo design, which also binds config crap to per package.  

226

This is actually a pro from a design/maintenance angle, but is a con 

227

since it's not something easily retrofitted into stable (note I'm 

228

reversing my stance here, I think it's doable although it's going to 

229

require a helluva lot of work).

230

231

So... why am I whoring the restriction subsystem?  Well, it's the

232

culmination of our talks.  It *is* effectively a pquery implementation 

233

at it's core.  It's what we originally talked about as being 

234

desirable, best route to go- hell, even the string output is inline 

235

with your original pquery syntax (sql like).

236

237

Matching code being pushed into stable I'm for, but I'd rather see an 

238

api/framework pushed in that is designed for long term (ab)use.  The 

239

patch provided is usable for your script, but any code that wanted to 

240

do similar complex scans is forced to duplicate the boolean crap- 

241

that's a really bad thing, something portage already suffers horribly 

242

from.  

243

244

Any matching functionality pushed into portage is going to be used by 

245

porthole and other portage consumers- the api _does_ need to be pretty 

246

sane (even if we may redesign everything else down the line).  

247

Complex search query capabilities in the consumer will be invariably 

248

bound to the underlying portage implementation, although hopefully at 

249

least somewhat decoupled- hence my belief this particular area _needs_ 

250

to be right from the get go.

251

252

So... my 2 cents, which admittedly can be construed as biased since 

253

I've already made the plunge with the restriction subsystem.

254

255

Just think it's the proper way to go; centralized design route you're 

256

going here I think is going to lead to unmaintainable code the more 

257

power that's jimmied into it.

258

~harring

1	Not intending to be harsh in this email, keep in mind
2	this is _just_ technical angle here. What you're attempting
3	I think is needed, just seemajor issues with form presented.
4
5
6	On Sat, Dec 10, 2005 at 05:11:07AM +0100, Marius Mauch wrote:
7	> Attached patch adds a new function metascan() in portage.py and aliases
8	> it to portdbapi.metascan() and vardbapi.metascan(). It's basically the
9	> same code as in metalib.py (in private/genone/scripts), just mildly
10	> tweaked to work inside portage.py.
11	> It provides the core functionality for my metascan script which is now
12	> basically just a cli wrapper for the function.
13
14	I'd rather not see this go into portage.py- I don't think the
15	interface is up to snuff atm (design reasons tail end of the email,
16	implementation issues following inline).
17
18
19	> --- pym/portage.py 2005-11-13 14:55:16.000000000 +0100
20	> +++ pym/portage.py 2005-12-09 18:39:44.000000000 +0100
21	> @@ -4387,6 +4387,110 @@
22	> return myslot
23	>
24	>
25	> +def __metascan_resolve_dep_strings(cpv, keys, values, mysettings):
26	<snip>
27	> +def __metascan_strip_atoms(keys, values, mysettings):
28
29	We're (mostly) all consenting adults, namespace mangling probably
30	isn't required here (yes, I'm guilty of it occasionally also,
31	just don't think it's required here).
32
33
34	> +def metascan(db, keys, values, negated, mysettings, scanlist=None, operator="or", resolveDepStrings=False, stripAtoms=False, partial=False, regex=False, catlimit=None, debug=0, verbose=False):
35	> + """
36	> + Function to find all packages matching certain metadata criteria.
37	> + """
38
39	Doc string is pretty sparse. No explanation of what
40	keys/values/negated is supposed to be (list? tuple? dict?)- should be
41	knowable from docstring if this is going to be api. Same with
42	operator, although operator's string based args really dislike.
43
44	Basically... without reading over the function and your script for
45	examples of the calls, this func's prototype/docstring really isn't
46	usable without doing a lot of digging.
47
48	No comment on returns, either- basically, prototype seems like it's
49	overgrown, like it needs to be chunked up and simplified, plus actual
50	docstrings.
51
52
53	General code comments, trailing comments on design...
54
55	> + if len(keys) != len(values) or len(keys) != len(negated):
56	> + raise IndexError("argument length mismatch")
57	> +
58	> + if scanlist == None:
59	> + scanlist = {}
60	> + plist = db.cp_all()
61	> + for p in plist:
62	> + scanlist[p] = []
63	> + for pv in db.cp_list(p):
64	> + scanlist[p].append(pv)
65
66	This is pretty icky; you're scanning the entire tree in a single run,
67	building it all up in memory prior to even doing the actual
68	matching.
69
70	Fex, no keys, and this will force a full scan for nothing.
71	Use a generator.
72
73	> +
74	> + resultlist = []
75	> +
76	> + for p in scanlist:
77	> + if debug > 1:
78	> + sys.stderr.write("Scanning package %s\n" % p)
79	> + # check if we have a category restriction and if that's the case, skip this package if we don't have a match
80	> + if catlimit != None and catsplit(p)[0] not in catlimit:
81	> + if debug > 1:
82	> + sys.stderr.write("Skipping package %s from category %s due to category limit (%s)\n" % (p, catsplit(p)[0], str(catlimit)))
83	> + continue
84	> + for pv in scanlist[p]:
85	> + try:
86	> + result = []
87	> + # this is the slow part, also generates noise if portage cache out of date
88	> + pvalues = db.aux_get(pv, keys)
89	> +
90	> + except KeyError:
91	> + sys.stderr.write("Error while scanning %s\n" % pv)
92
93	No lib code should ever be dumping to stderr on it's own for non
94	fatal errors.
95
96	Yes portage violates it, but portage code also sucks.
97
98	> + continue
99	> +
100	> + # save original values for debug
101	> + if debug > 1:
102	> + keys_uniq = []
103	> + pvalues_orig = []
104	> + for i in range(0, len(keys)):
105	> + if not keys[i] in keys_uniq:
106	> + keys_uniq.append(keys[i])
107	> + pvalues_orig.append(pvalues[i])
108	Totally unnecessary here. Do a single unique call of keys rather
109	then doing it 20,000+ times for each cpv. Same thing for the
110	db.aux_get call above.
111
112
113	> +
114	> + # remove conditional deps whose coditions aren't met
115	> + if resolveDepStrings:
116	> + pvalues = __metascan_resolve_dep_strings(pv, keys, pvalues, settings)
117	> +
118	> + # we're only interested in the cat/pkg stuff from an atom here, so strip the rest
119	> + if stripAtoms:
120	> + pvalues = __metascan_strip_atoms(keys, pvalues, settings)
121	> +
122	> + # report also partial matches, e.g. "foo" in "foomatic"
123	> + if partial or regex:
124	> + for i in range(0, len(pvalues)):
125
126	the 0 initial arg isn't needed.
127
128	> + result.append((partial and pvalues[i].find(values[i]) >= 0) \
129	> + or (regex and bool(re.match(values[i], pvalues[i]))))
130
131	Should be caching regex compilations instead of continual
132	recompilation. Speed up there...
133
134	> +
135	> + # we're only interested in full matches in general
136	> + else:
137	> + result = [values[i] in pvalues[i].split() for i in range(0, len(pvalues))]
138	> +
139	> + # some funky logic operations to invert the adjust the match if negations were requested
140	> + result = [(negated[i] and not result[i]) or (not negated[i] and result[i]) for i in range(0, len(result))]
141	xor baby... xor.
142
143	> +
144	> + # more logic stuff for conjunction or disjunction
145	> + if (operator == "or" and True in result) or (operator == "and" and not False in result):
146	> + if debug > 0:
147	> + sys.stderr.write("Match found: %s\n" % pv)
148	> + if debug > 1:
149	> + for i in range(0, len(keys_uniq)):
150	> + sys.stderr.write("%s from %s: %s\n" % (keys_uniq[i], pv, pvalues_orig[i]))
151	> + if verbose:
152	> + resultlist.append(pv)
153	> + else:
154	> + if not p in resultlist:
155	> + resultlist.append(p)
156
157	No no no no no... use a set or a dict. While it's easy/obvious code
158	wise, don't do linear searches like that unless you know the set is
159	going to be _small_ (sub 10, although timeit runs would bear out a
160	more accurate figure). Little stuff like this when removed provides a
161	_lot_ of speedups- little stuff like this is why people piss and moan
162	python is slow when it's usually bad algo's that are at fault...
163
164	> + return resultlist
165
166	this should be a generator, and yielding. Unless I've missed
167	something obvious from above, this code doesn't have any reason to
168	build up a list of returns like this instead of just yielding as it
169	determines matches. Plus side, decreased interim mem usage, and
170	calling code can display to the user immediately on returns.
171	Con? Have to change all your appends. Your boolean logic
172	implementations are going to get complex also, since you'll have to
173	maintain a stack of potential matches.
174
175
176	General commentary... this isn't flexible enough. Extension of the
177	matching code is dependant on extension of the metascan func- adding
178	an xor (fex) isn't viable without modifying the code.
179
180	Further, you can't do subclass matching without abusing scanlist- this
181	is an indication (imo) that the interface needs an overhaul.
182
183	To pull off arbitrary matching, fex
184	( ( category = blah or package = blah ) and ( description = dar ) )
185	requires 3 seperate calls raiding from the returnlist and handing
186	it back in as scanlist. This I view as a general design failure;
187	what's needed is a way to arbitrarily combine restrictions/search
188	criteria.
189
190	This base functionality, wrapped in some serious voodoo classes could
191	pull it off although maintaining it would be ugly. Still would have
192	all of the matching logic embedded in a central func though, which
193	isn't extensible.
194
195	I think you're coming at this from the wrong angle- I'd advocate the
196	restriction subsystem design that is sitting in saviour frankly.
197	Encapsulate the individual restrictions, and encapsulate boolean
198	matching logic within classes.
199
200	Go the route the restriction subsystem inherintely allows, you've got
201	extensibility up the ying yang- can use either base classes provided,
202	or provide your own nonstandard matching/restrictions and just slap it
203	in. The restriction subsystem _is_ pretty damn close to the pquery
204	language you'd talk of a long while back, the only chunk missing from
205	actually having your pquery language is a tokenizer that converts
206	sql/pquery akin strings into restriction elements that can be used for
207	matching.
208
209	Pros of the saviour restriction subsystem?
210	1) it covers all matching. atom look ups included. IOW, we don't
211	have two methods of searching/lookup, we have a single point to
212	optimize the hell out of.
213	2) arbitrary depths of boolean terms built in via the design, no dance
214	required by callers to filter down lists, managing boolean logic
215	itself.
216
217	Cons?
218	1) relies on an actual package agnostic class- iow, whatever the
219	format the 'package' is, it would still work with it. That said,
220	format classes have to be have a basic common api/set of attributes.
221	2) query optimization can be a bitch. You've got a tree of
222	restrictions, an optimizer that is capable of identifying subset of
223	categories/packages instead of just trying all nodes would be useful
224	(speed things up by reducing the set of potential matches).
225	3) bound to repo design, which also binds config crap to per package.
226	This is actually a pro from a design/maintenance angle, but is a con
227	since it's not something easily retrofitted into stable (note I'm
228	reversing my stance here, I think it's doable although it's going to
229	require a helluva lot of work).
230
231	So... why am I whoring the restriction subsystem? Well, it's the
232	culmination of our talks. It is effectively a pquery implementation
233	at it's core. It's what we originally talked about as being
234	desirable, best route to go- hell, even the string output is inline
235	with your original pquery syntax (sql like).
236
237	Matching code being pushed into stable I'm for, but I'd rather see an
238	api/framework pushed in that is designed for long term (ab)use. The
239	patch provided is usable for your script, but any code that wanted to
240	do similar complex scans is forced to duplicate the boolean crap-
241	that's a really bad thing, something portage already suffers horribly
242	from.
243
244	Any matching functionality pushed into portage is going to be used by
245	porthole and other portage consumers- the api _does_ need to be pretty
246	sane (even if we may redesign everything else down the line).
247	Complex search query capabilities in the consumer will be invariably
248	bound to the underlying portage implementation, although hopefully at
249	least somewhat decoupled- hence my belief this particular area _needs_
250	to be right from the get go.
251
252	So... my 2 cents, which admittedly can be construed as biased since
253	I've already made the plunge with the restriction subsystem.
254
255	Just think it's the proper way to go; centralized design route you're
256	going here I think is going to lead to unmaintainable code the more
257	power that's jimmied into it.
258	~harring

Gentoo Archives: gentoo-portage-dev