Re: [gentoo-portage-dev] Re: search functionality in emerge - gentoo-portage-dev

From:	Tambet <qtvali@×××××.com>
To:	gentoo-portage-dev@l.g.o
Subject:	Re: [gentoo-portage-dev] Re: search functionality in emerge
Date:	Tue, 02 Dec 2008 21:47:23
Message-Id:	`cea53e3c0812021347k21138fcet66d9dfd19a195886@mail.gmail.com`
In Reply to:	Re: [gentoo-portage-dev] Re: search functionality in emerge by Alec Warner

1

It might be that your hard drive is not that much slower than memory, then,

2

but I really doubt this one ...or it could mean that reading gzip out is

3

much slower than reading cat - and this one is highly probable. I mean, file

4

size of gzip.

5

6

Actually it's elementary logic that decompressing is faster than just

7

loading. What I personally did use was *much *faster than without

8

compressing, but that was also c++ application having this zip always in

9

memory and this was also highly inefficiently stored data at first.

10

11

I suggest you such test to understand me - make some file and write there

12

character "a" about 10 000 000 times to get it that big, then try the same

13

thing on that file. I think it's probable that it will be real fast to

14

decompress the resulting file.

15

16

Anyway, you have still made me think that at first, no zip should be used :)

17

- just because your tests took several new variables in like speed of

18

reading decompression utility from disk.

19

20

Tambet - technique evolves to art, art evolves to magic, magic evolves to

21

just doing.

22

23

24

2008/12/2 Alec Warner <antarus@g.o>

25

26

> On Tue, Dec 2, 2008 at 4:42 AM, Tambet <qtvali@×××××.com> wrote:

27

> > About zipping.. Default settings might not really be good idea - i think

28

> > that "fastest" might be even better. Considering that portage tree

29

> contains

30

> > same word again and again (like "applications") it needs pretty small

31

> > dictionary to make it much smaller. Decompressing will not be reading

32

> from

33

> > disc, decompressing and writing back to disc as in your case probably -

34

> try

35

> > decompression to memory drive and you might get better numbers.

36

>

37

> I ran gzip -d -c file.gz > /dev/null, which should not write to disk.

38

>

39

> I tried again with gzip -1 and it still takes 29ms to decompress (even

40

> with gzip -1) where a bare read takes 26ms.  (I have a 2.6Ghz X2 which

41

> is probably relevant to gzip decompression speed)

42

>

43

> >

44

> > I have personally used compression in one c++ application and with

45

> optimum

46

> > settings, it made things much faster - those were files, where i had for

47

> > example 65536 16-byte integers, which could be zeros and mostly were; I

48

> > didnt care about creating better file format, but just compressed the

49

> whole

50

> > thing.

51

>

52

> I'm not saying compression won't make the index smaller.  I'm saying

53

> making the index smaller does not improve performance.  If you have a

54

> 10 meg file and you make it 1 meg, you do not increase performance

55

> because you (on average) are not saving enough time reading the

56

> smaller file; since you pay it in decompressing the smaller file

57

> later.

58

>

59

> >

60

> > I suggest you to compress esearch db, then decompress it to memory drive

61

> and

62

> > give us those numbers - might be considerably faster.

63

>

64

> gzip -d -c esearchdb.py.gz > /dev/null (compressed with gzip -1) takes

65

> on average (6 trials, dropped caches between trials) takes 35.1666ms

66

>

67

> cat esearchdb.py > /dev/null (uncompressed) takes on average of 6 trials,

68

> 24ms.

69

>

70

> The point is you use compression when you need to save space (sending

71

> data over the network, or storing large amounts of data or a lot of

72

> something).  The index isn't going to be big (if it is bigger than 20

73

> or 30 meg I'll be surprised), the index isn't going over the network

74

> and there is only 1 index, not say a million indexes (where

75

> compression might actually be useful for some kind of LRU subset of

76

> indexes to meet disk requirements).

77

>

78

> Anyway this is all moot since as you stated so well earlier,

79

> optimization comes last, so stop trying to do it now ;)

80

>

81

> -Alec

82

>

83

> >

84

> > http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip

85

> support.

86

> > Try open of that and normal open on esearch db; also compress with the

87

> same

88

> > lib to get right kind of file.

89

> >

90

> > Anyway - maybe this compression should be later added and optional.

91

> >

92

> > Tambet - technique evolves to art, art evolves to magic, magic evolves to

93

> > just doing.

94

> >

95

> >

96

> > 2008/12/2 Alec Warner <antarus@g.o>

97

> >>

98

> >> On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@×××××.com> wrote:

99

> >> > 2008/12/2 Emma Strubell <emma.strubell@×××××.com>

100

> >> >>

101

> >> >> True, true. Like I said, I don't really use overlays, so excuse my

102

> >> >> igonrance.

103

> >> >

104

> >> > Do you know an order of doing things:

105

> >> >

106

> >> > Rules of Optimization:

107

> >> >

108

> >> > Rule 1: Don't do it.

109

> >> > Rule 2 (for experts only): Don't do it yet.

110

> >> >

111

> >> > What this actually means - functionality comes first. Readability

112

> comes

113

> >> > next. Optimization comes last. Unless you are creating a fancy 3D

114

> engine

115

> >> > for

116

> >> > kung fu game.

117

> >> >

118

> >> > If you are going to exclude overlays, you are removing functionality -

119

> >> > and,

120

> >> > indeed, absolutely has-to-be-there functionality, because noone would

121

> >> > intuitively expect search function to search only one subset of

122

> >> > packages,

123

> >> > however reasonable this subset would be. So, you can't, just can't,

124

> add

125

> >> > this

126

> >> > package into portage base - you could write just another external

127

> search

128

> >> > package for portage.

129

> >> >

130

> >> > I looked this code a bit and:

131

> >> > Portage's "__init__.py" contains comment "# search functionality".

132

> After

133

> >> > this comment, there is a nice and simple search class.

134

> >> > It also contains method "def action_sync(...)", which contains

135

> >> > synchronization stuff.

136

> >> >

137

> >> > Now, search class will be initialized by setting up 3 databases -

138

> >> > porttree,

139

> >> > bintree and vartree, whatever those are. Those will be in self._dbs

140

> >> > array

141

> >> > and porttree will be in self._portdb.

142

> >> >

143

> >> > It contains some more methods:

144

> >> > _findname(...) will return result of self._portdb.findname(...) with

145

> >> > same

146

> >> > parameters or None if it does not exist.

147

> >> > Other methods will do similar things - map one or another method.

148

> >> > execute will do the real search...

149

> >> > Now - "for package in self.portdb.cp_all()" is important here ...it

150

> >> > currently loops over whole portage tree. All kinds of matching will be

151

> >> > done

152

> >> > inside.

153

> >> > self.portdb obviously points to porttree.py (unless it points to fake

154

> >> > tree).

155

> >> > cp_all will take all porttrees and do simple file search inside. This

156

> >> > method

157

> >> > should contain optional index search.

158

> >> >

159

> >> >               self.porttrees = [self.porttree_root] + \

160

> >> >                       [os.path.realpath(t) for t in

161

> >> > self.mysettings["PORTDIR_OVERLAY"].split()]

162

> >> >

163

> >> > So, self.porttrees contains list of trees - first of them is root,

164

> >> > others

165

> >> > are overlays.

166

> >> >

167

> >> > Now, what you have to do will not be harder just because of having

168

> >> > overlay

169

> >> > search, too.

170

> >> >

171

> >> > You have to create method def cp_index(self), which will return

172

> >> > dictionary

173

> >> > containing package names as keys. For oroot... will be

174

> >> > "self.porttrees[1:]",

175

> >> > not "self.porttrees" - this will only search overlays. d = {} will be

176

> >> > replaced with d = self.cp_index(). If index is not there, old version

177

> >> > will

178

> >> > be used (thus, you have to make internal porttrees variable, which

179

> >> > contains

180

> >> > all or all except first).

181

> >> >

182

> >> > Other methods used by search are xmatch and aux_get - first used

183

> several

184

> >> > times and last one used to get description. You have to cache results

185

> of

186

> >> > those specific queries and make them use your cache - as you can see,

187

> >> > those

188

> >> > parts of portage are already able to use overlays. Thus, you have to

189

> put

190

> >> > your code again in beginning of those functions - create index_xmatch

191

> >> > and

192

> >> > index_aux_get methods, then make those methods use them and return

193

> their

194

> >> > results unless those are None (or something other in case none is

195

> >> > already

196

> >> > legal result) - if they return None, old code will be run and do it's

197

> >> > job.

198

> >> > If index is not created, result is None. In index_** methods, just

199

> check

200

> >> > if

201

> >> > query is what you can answer and if it is, then answer it.

202

> >> >

203

> >> > Obviously, the simplest way to create your index is to delete index,

204

> >> > then

205

> >> > use those same methods to query for all nessecary information - and

206

> >> > fastest

207

> >> > way would be to add updating index directly into sync, which you could

208

> >> > do

209

> >> > later.

210

> >> >

211

> >> > Please, also, make those commands to turn index on and off (last one

212

> >> > should

213

> >> > also delete it to save disk space). Default should be off until it's

214

> >> > fast,

215

> >> > small and reliable. Also notice that if index is kept on hard drive,

216

> it

217

> >> > might be faster if it's compressed (gz, for example) - decompressing

218

> >> > takes

219

> >> > less time and more processing power than reading it fully out.

220

> >>

221

> >> I'm pretty sure your mistaken here, unless your index is stored on a

222

> >> floppy or something really slow.

223

> >>

224

> >> A disk read has 2 primary costs.

225

> >>

226

> >> Seek Time: Time for the head to seek to the sector of disk you want.

227

> >> Spin Time: Time for the platter to spin around such that the sector

228

> >> you want is under the read head.

229

> >>

230

> >> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120

231

> >> rotations per second, so worst case (you just passed the sector you

232

> >> need) you need to wait 1/120th of a second (or 8ms).

233

> >>

234

> >> Seek Time is per hard drive, but most drives provide average seek

235

> >> times under 10ms.

236

> >>

237

> >> So it takes on average 18ms to get to your data, then you start

238

> >> reading.  The index will not be that large (my esearchdb is 2 megs,

239

> >> but lets assume 10MB for this compressed index).

240

> >>

241

> >> I took a 10MB meg sqlite database and compressed it with gzip (default

242

> >> settings) down to 5 megs.

243

> >> gzip -d on the database takes 300ms, catting the decompressed data

244

> >> base takes 88ms (average of 5 runs, drop disk caches between runs).

245

> >>

246

> >> I then tried my vdb_metadata.pickle from

247

> >> /var/cache/edb/vdb_metadata.pickle

248

> >>

249

> >> 1.3megs compresses to 390k.

250

> >>

251

> >> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from

252

> >> disk.

253

> >>

254

> >> Your index would have to be very large or very fragmented on disk

255

> >> (requiring more than one seek) to see a significant gain in file

256

> >> compression (gzip scales linearly).

257

> >>

258

> >> In short, don't compress the index ;p

259

> >>

260

> >> >

261

> >> > Have luck!

262

> >> >

263

> >> >>> -----BEGIN PGP SIGNED MESSAGE-----

264

> >> >>> Hash: SHA1

265

> >> >>>

266

> >> >>> Emma Strubell schrieb:

267

> >> >>> > 2) does anyone really need to search an overlay anyway?

268

> >> >>>

269

> >> >>> Of course. Take large (semi-)official overlays like sunrise. They

270

> can

271

> >> >>> easily be seen as a second portage tree.

272

> >> >>> -----BEGIN PGP SIGNATURE-----

273

> >> >>> Version: GnuPG v2.0.9 (GNU/Linux)

274

> >> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

275

> >> >>>

276

> >> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt

277

> >> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S

278

> >> >>> =+lCO

279

> >> >>> -----END PGP SIGNATURE-----

280

> >> >>>

281

> >> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann

282

> <lists@××××××.eu>

283

> >> >> wrote:

284

> >> >>

285

> >> >

286

> >> >

287

> >

288

> >

289

>

1	It might be that your hard drive is not that much slower than memory, then,
2	but I really doubt this one ...or it could mean that reading gzip out is
3	much slower than reading cat - and this one is highly probable. I mean, file
4	size of gzip.
5
6	Actually it's elementary logic that decompressing is faster than just
7	loading. What I personally did use was much faster than without
8	compressing, but that was also c++ application having this zip always in
9	memory and this was also highly inefficiently stored data at first.
10
11	I suggest you such test to understand me - make some file and write there
12	character "a" about 10 000 000 times to get it that big, then try the same
13	thing on that file. I think it's probable that it will be real fast to
14	decompress the resulting file.
15
16	Anyway, you have still made me think that at first, no zip should be used :)
17	- just because your tests took several new variables in like speed of
18	reading decompression utility from disk.
19
20	Tambet - technique evolves to art, art evolves to magic, magic evolves to
21	just doing.
22
23
24	2008/12/2 Alec Warner <antarus@g.o>
25
26	> On Tue, Dec 2, 2008 at 4:42 AM, Tambet <qtvali@×××××.com> wrote:
27	> > About zipping.. Default settings might not really be good idea - i think
28	> > that "fastest" might be even better. Considering that portage tree
29	> contains
30	> > same word again and again (like "applications") it needs pretty small
31	> > dictionary to make it much smaller. Decompressing will not be reading
32	> from
33	> > disc, decompressing and writing back to disc as in your case probably -
34	> try
35	> > decompression to memory drive and you might get better numbers.
36	>
37	> I ran gzip -d -c file.gz > /dev/null, which should not write to disk.
38	>
39	> I tried again with gzip -1 and it still takes 29ms to decompress (even
40	> with gzip -1) where a bare read takes 26ms. (I have a 2.6Ghz X2 which
41	> is probably relevant to gzip decompression speed)
42	>
43	> >
44	> > I have personally used compression in one c++ application and with
45	> optimum
46	> > settings, it made things much faster - those were files, where i had for
47	> > example 65536 16-byte integers, which could be zeros and mostly were; I
48	> > didnt care about creating better file format, but just compressed the
49	> whole
50	> > thing.
51	>
52	> I'm not saying compression won't make the index smaller. I'm saying
53	> making the index smaller does not improve performance. If you have a
54	> 10 meg file and you make it 1 meg, you do not increase performance
55	> because you (on average) are not saving enough time reading the
56	> smaller file; since you pay it in decompressing the smaller file
57	> later.
58	>
59	> >
60	> > I suggest you to compress esearch db, then decompress it to memory drive
61	> and
62	> > give us those numbers - might be considerably faster.
63	>
64	> gzip -d -c esearchdb.py.gz > /dev/null (compressed with gzip -1) takes
65	> on average (6 trials, dropped caches between trials) takes 35.1666ms
66	>
67	> cat esearchdb.py > /dev/null (uncompressed) takes on average of 6 trials,
68	> 24ms.
69	>
70	> The point is you use compression when you need to save space (sending
71	> data over the network, or storing large amounts of data or a lot of
72	> something). The index isn't going to be big (if it is bigger than 20
73	> or 30 meg I'll be surprised), the index isn't going over the network
74	> and there is only 1 index, not say a million indexes (where
75	> compression might actually be useful for some kind of LRU subset of
76	> indexes to meet disk requirements).
77	>
78	> Anyway this is all moot since as you stated so well earlier,
79	> optimization comes last, so stop trying to do it now ;)
80	>
81	> -Alec
82	>
83	> >
84	> > http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip
85	> support.
86	> > Try open of that and normal open on esearch db; also compress with the
87	> same
88	> > lib to get right kind of file.
89	> >
90	> > Anyway - maybe this compression should be later added and optional.
91	> >
92	> > Tambet - technique evolves to art, art evolves to magic, magic evolves to
93	> > just doing.
94	> >
95	> >
96	> > 2008/12/2 Alec Warner <antarus@g.o>
97	> >>
98	> >> On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@×××××.com> wrote:
99	> >> > 2008/12/2 Emma Strubell <emma.strubell@×××××.com>
100	> >> >>
101	> >> >> True, true. Like I said, I don't really use overlays, so excuse my
102	> >> >> igonrance.
103	> >> >
104	> >> > Do you know an order of doing things:
105	> >> >
106	> >> > Rules of Optimization:
107	> >> >
108	> >> > Rule 1: Don't do it.
109	> >> > Rule 2 (for experts only): Don't do it yet.
110	> >> >
111	> >> > What this actually means - functionality comes first. Readability
112	> comes
113	> >> > next. Optimization comes last. Unless you are creating a fancy 3D
114	> engine
115	> >> > for
116	> >> > kung fu game.
117	> >> >
118	> >> > If you are going to exclude overlays, you are removing functionality -
119	> >> > and,
120	> >> > indeed, absolutely has-to-be-there functionality, because noone would
121	> >> > intuitively expect search function to search only one subset of
122	> >> > packages,
123	> >> > however reasonable this subset would be. So, you can't, just can't,
124	> add
125	> >> > this
126	> >> > package into portage base - you could write just another external
127	> search
128	> >> > package for portage.
129	> >> >
130	> >> > I looked this code a bit and:
131	> >> > Portage's "__init__.py" contains comment "# search functionality".
132	> After
133	> >> > this comment, there is a nice and simple search class.
134	> >> > It also contains method "def action_sync(...)", which contains
135	> >> > synchronization stuff.
136	> >> >
137	> >> > Now, search class will be initialized by setting up 3 databases -
138	> >> > porttree,
139	> >> > bintree and vartree, whatever those are. Those will be in self._dbs
140	> >> > array
141	> >> > and porttree will be in self._portdb.
142	> >> >
143	> >> > It contains some more methods:
144	> >> > _findname(...) will return result of self._portdb.findname(...) with
145	> >> > same
146	> >> > parameters or None if it does not exist.
147	> >> > Other methods will do similar things - map one or another method.
148	> >> > execute will do the real search...
149	> >> > Now - "for package in self.portdb.cp_all()" is important here ...it
150	> >> > currently loops over whole portage tree. All kinds of matching will be
151	> >> > done
152	> >> > inside.
153	> >> > self.portdb obviously points to porttree.py (unless it points to fake
154	> >> > tree).
155	> >> > cp_all will take all porttrees and do simple file search inside. This
156	> >> > method
157	> >> > should contain optional index search.
158	> >> >
159	> >> > self.porttrees = [self.porttree_root] + \
160	> >> > [os.path.realpath(t) for t in
161	> >> > self.mysettings["PORTDIR_OVERLAY"].split()]
162	> >> >
163	> >> > So, self.porttrees contains list of trees - first of them is root,
164	> >> > others
165	> >> > are overlays.
166	> >> >
167	> >> > Now, what you have to do will not be harder just because of having
168	> >> > overlay
169	> >> > search, too.
170	> >> >
171	> >> > You have to create method def cp_index(self), which will return
172	> >> > dictionary
173	> >> > containing package names as keys. For oroot... will be
174	> >> > "self.porttrees[1:]",
175	> >> > not "self.porttrees" - this will only search overlays. d = {} will be
176	> >> > replaced with d = self.cp_index(). If index is not there, old version
177	> >> > will
178	> >> > be used (thus, you have to make internal porttrees variable, which
179	> >> > contains
180	> >> > all or all except first).
181	> >> >
182	> >> > Other methods used by search are xmatch and aux_get - first used
183	> several
184	> >> > times and last one used to get description. You have to cache results
185	> of
186	> >> > those specific queries and make them use your cache - as you can see,
187	> >> > those
188	> >> > parts of portage are already able to use overlays. Thus, you have to
189	> put
190	> >> > your code again in beginning of those functions - create index_xmatch
191	> >> > and
192	> >> > index_aux_get methods, then make those methods use them and return
193	> their
194	> >> > results unless those are None (or something other in case none is
195	> >> > already
196	> >> > legal result) - if they return None, old code will be run and do it's
197	> >> > job.
198	> >> > If index is not created, result is None. In index_** methods, just
199	> check
200	> >> > if
201	> >> > query is what you can answer and if it is, then answer it.
202	> >> >
203	> >> > Obviously, the simplest way to create your index is to delete index,
204	> >> > then
205	> >> > use those same methods to query for all nessecary information - and
206	> >> > fastest
207	> >> > way would be to add updating index directly into sync, which you could
208	> >> > do
209	> >> > later.
210	> >> >
211	> >> > Please, also, make those commands to turn index on and off (last one
212	> >> > should
213	> >> > also delete it to save disk space). Default should be off until it's
214	> >> > fast,
215	> >> > small and reliable. Also notice that if index is kept on hard drive,
216	> it
217	> >> > might be faster if it's compressed (gz, for example) - decompressing
218	> >> > takes
219	> >> > less time and more processing power than reading it fully out.
220	> >>
221	> >> I'm pretty sure your mistaken here, unless your index is stored on a
222	> >> floppy or something really slow.
223	> >>
224	> >> A disk read has 2 primary costs.
225	> >>
226	> >> Seek Time: Time for the head to seek to the sector of disk you want.
227	> >> Spin Time: Time for the platter to spin around such that the sector
228	> >> you want is under the read head.
229	> >>
230	> >> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120
231	> >> rotations per second, so worst case (you just passed the sector you
232	> >> need) you need to wait 1/120th of a second (or 8ms).
233	> >>
234	> >> Seek Time is per hard drive, but most drives provide average seek
235	> >> times under 10ms.
236	> >>
237	> >> So it takes on average 18ms to get to your data, then you start
238	> >> reading. The index will not be that large (my esearchdb is 2 megs,
239	> >> but lets assume 10MB for this compressed index).
240	> >>
241	> >> I took a 10MB meg sqlite database and compressed it with gzip (default
242	> >> settings) down to 5 megs.
243	> >> gzip -d on the database takes 300ms, catting the decompressed data
244	> >> base takes 88ms (average of 5 runs, drop disk caches between runs).
245	> >>
246	> >> I then tried my vdb_metadata.pickle from
247	> >> /var/cache/edb/vdb_metadata.pickle
248	> >>
249	> >> 1.3megs compresses to 390k.
250	> >>
251	> >> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from
252	> >> disk.
253	> >>
254	> >> Your index would have to be very large or very fragmented on disk
255	> >> (requiring more than one seek) to see a significant gain in file
256	> >> compression (gzip scales linearly).
257	> >>
258	> >> In short, don't compress the index ;p
259	> >>
260	> >> >
261	> >> > Have luck!
262	> >> >
263	> >> >>> -----BEGIN PGP SIGNED MESSAGE-----
264	> >> >>> Hash: SHA1
265	> >> >>>
266	> >> >>> Emma Strubell schrieb:
267	> >> >>> > 2) does anyone really need to search an overlay anyway?
268	> >> >>>
269	> >> >>> Of course. Take large (semi-)official overlays like sunrise. They
270	> can
271	> >> >>> easily be seen as a second portage tree.
272	> >> >>> -----BEGIN PGP SIGNATURE-----
273	> >> >>> Version: GnuPG v2.0.9 (GNU/Linux)
274	> >> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
275	> >> >>>
276	> >> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt
277	> >> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S
278	> >> >>> =+lCO
279	> >> >>> -----END PGP SIGNATURE-----
280	> >> >>>
281	> >> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann
282	> <lists@××××××.eu>
283	> >> >> wrote:
284	> >> >>
285	> >> >
286	> >> >
287	> >
288	> >
289	>

Gentoo Archives: gentoo-portage-dev