Re: [gentoo-portage-dev] Re: search functionality in emerge - gentoo-portage-dev

From:	Alec Warner <antarus@g.o>
To:	gentoo-portage-dev@l.g.o
Subject:	Re: [gentoo-portage-dev] Re: search functionality in emerge
Date:	Tue, 02 Dec 2008 19:54:11
Message-Id:	`b41005390812021154kf3e0494k37bd3c54cc0e316d@mail.gmail.com`
In Reply to:	Re: [gentoo-portage-dev] Re: search functionality in emerge by Tambet

1

On Tue, Dec 2, 2008 at 4:42 AM, Tambet <qtvali@×××××.com> wrote:

2

> About zipping.. Default settings might not really be good idea - i think

3

> that "fastest" might be even better. Considering that portage tree contains

4

> same word again and again (like "applications") it needs pretty small

5

> dictionary to make it much smaller. Decompressing will not be reading from

6

> disc, decompressing and writing back to disc as in your case probably - try

7

> decompression to memory drive and you might get better numbers.

8

9

I ran gzip -d -c file.gz > /dev/null, which should not write to disk.

10

11

I tried again with gzip -1 and it still takes 29ms to decompress (even

12

with gzip -1) where a bare read takes 26ms.  (I have a 2.6Ghz X2 which

13

is probably relevant to gzip decompression speed)

14

15

>

16

> I have personally used compression in one c++ application and with optimum

17

> settings, it made things much faster - those were files, where i had for

18

> example 65536 16-byte integers, which could be zeros and mostly were; I

19

> didnt care about creating better file format, but just compressed the whole

20

> thing.

21

22

I'm not saying compression won't make the index smaller.  I'm saying

23

making the index smaller does not improve performance.  If you have a

24

10 meg file and you make it 1 meg, you do not increase performance

25

because you (on average) are not saving enough time reading the

26

smaller file; since you pay it in decompressing the smaller file

27

later.

28

29

>

30

> I suggest you to compress esearch db, then decompress it to memory drive and

31

> give us those numbers - might be considerably faster.

32

33

gzip -d -c esearchdb.py.gz > /dev/null (compressed with gzip -1) takes

34

on average (6 trials, dropped caches between trials) takes 35.1666ms

35

36

cat esearchdb.py > /dev/null (uncompressed) takes on average of 6 trials, 24ms.

37

38

The point is you use compression when you need to save space (sending

39

data over the network, or storing large amounts of data or a lot of

40

something).  The index isn't going to be big (if it is bigger than 20

41

or 30 meg I'll be surprised), the index isn't going over the network

42

and there is only 1 index, not say a million indexes (where

43

compression might actually be useful for some kind of LRU subset of

44

indexes to meet disk requirements).

45

46

Anyway this is all moot since as you stated so well earlier,

47

optimization comes last, so stop trying to do it now ;)

48

49

-Alec

50

51

>

52

> http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip support.

53

> Try open of that and normal open on esearch db; also compress with the same

54

> lib to get right kind of file.

55

>

56

> Anyway - maybe this compression should be later added and optional.

57

>

58

> Tambet - technique evolves to art, art evolves to magic, magic evolves to

59

> just doing.

60

>

61

>

62

> 2008/12/2 Alec Warner <antarus@g.o>

63

>>

64

>> On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@×××××.com> wrote:

65

>> > 2008/12/2 Emma Strubell <emma.strubell@×××××.com>

66

>> >>

67

>> >> True, true. Like I said, I don't really use overlays, so excuse my

68

>> >> igonrance.

69

>> >

70

>> > Do you know an order of doing things:

71

>> >

72

>> > Rules of Optimization:

73

>> >

74

>> > Rule 1: Don't do it.

75

>> > Rule 2 (for experts only): Don't do it yet.

76

>> >

77

>> > What this actually means - functionality comes first. Readability comes

78

>> > next. Optimization comes last. Unless you are creating a fancy 3D engine

79

>> > for

80

>> > kung fu game.

81

>> >

82

>> > If you are going to exclude overlays, you are removing functionality -

83

>> > and,

84

>> > indeed, absolutely has-to-be-there functionality, because noone would

85

>> > intuitively expect search function to search only one subset of

86

>> > packages,

87

>> > however reasonable this subset would be. So, you can't, just can't, add

88

>> > this

89

>> > package into portage base - you could write just another external search

90

>> > package for portage.

91

>> >

92

>> > I looked this code a bit and:

93

>> > Portage's "__init__.py" contains comment "# search functionality". After

94

>> > this comment, there is a nice and simple search class.

95

>> > It also contains method "def action_sync(...)", which contains

96

>> > synchronization stuff.

97

>> >

98

>> > Now, search class will be initialized by setting up 3 databases -

99

>> > porttree,

100

>> > bintree and vartree, whatever those are. Those will be in self._dbs

101

>> > array

102

>> > and porttree will be in self._portdb.

103

>> >

104

>> > It contains some more methods:

105

>> > _findname(...) will return result of self._portdb.findname(...) with

106

>> > same

107

>> > parameters or None if it does not exist.

108

>> > Other methods will do similar things - map one or another method.

109

>> > execute will do the real search...

110

>> > Now - "for package in self.portdb.cp_all()" is important here ...it

111

>> > currently loops over whole portage tree. All kinds of matching will be

112

>> > done

113

>> > inside.

114

>> > self.portdb obviously points to porttree.py (unless it points to fake

115

>> > tree).

116

>> > cp_all will take all porttrees and do simple file search inside. This

117

>> > method

118

>> > should contain optional index search.

119

>> >

120

>> >               self.porttrees = [self.porttree_root] + \

121

>> >                       [os.path.realpath(t) for t in

122

>> > self.mysettings["PORTDIR_OVERLAY"].split()]

123

>> >

124

>> > So, self.porttrees contains list of trees - first of them is root,

125

>> > others

126

>> > are overlays.

127

>> >

128

>> > Now, what you have to do will not be harder just because of having

129

>> > overlay

130

>> > search, too.

131

>> >

132

>> > You have to create method def cp_index(self), which will return

133

>> > dictionary

134

>> > containing package names as keys. For oroot... will be

135

>> > "self.porttrees[1:]",

136

>> > not "self.porttrees" - this will only search overlays. d = {} will be

137

>> > replaced with d = self.cp_index(). If index is not there, old version

138

>> > will

139

>> > be used (thus, you have to make internal porttrees variable, which

140

>> > contains

141

>> > all or all except first).

142

>> >

143

>> > Other methods used by search are xmatch and aux_get - first used several

144

>> > times and last one used to get description. You have to cache results of

145

>> > those specific queries and make them use your cache - as you can see,

146

>> > those

147

>> > parts of portage are already able to use overlays. Thus, you have to put

148

>> > your code again in beginning of those functions - create index_xmatch

149

>> > and

150

>> > index_aux_get methods, then make those methods use them and return their

151

>> > results unless those are None (or something other in case none is

152

>> > already

153

>> > legal result) - if they return None, old code will be run and do it's

154

>> > job.

155

>> > If index is not created, result is None. In index_** methods, just check

156

>> > if

157

>> > query is what you can answer and if it is, then answer it.

158

>> >

159

>> > Obviously, the simplest way to create your index is to delete index,

160

>> > then

161

>> > use those same methods to query for all nessecary information - and

162

>> > fastest

163

>> > way would be to add updating index directly into sync, which you could

164

>> > do

165

>> > later.

166

>> >

167

>> > Please, also, make those commands to turn index on and off (last one

168

>> > should

169

>> > also delete it to save disk space). Default should be off until it's

170

>> > fast,

171

>> > small and reliable. Also notice that if index is kept on hard drive, it

172

>> > might be faster if it's compressed (gz, for example) - decompressing

173

>> > takes

174

>> > less time and more processing power than reading it fully out.

175

>>

176

>> I'm pretty sure your mistaken here, unless your index is stored on a

177

>> floppy or something really slow.

178

>>

179

>> A disk read has 2 primary costs.

180

>>

181

>> Seek Time: Time for the head to seek to the sector of disk you want.

182

>> Spin Time: Time for the platter to spin around such that the sector

183

>> you want is under the read head.

184

>>

185

>> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120

186

>> rotations per second, so worst case (you just passed the sector you

187

>> need) you need to wait 1/120th of a second (or 8ms).

188

>>

189

>> Seek Time is per hard drive, but most drives provide average seek

190

>> times under 10ms.

191

>>

192

>> So it takes on average 18ms to get to your data, then you start

193

>> reading.  The index will not be that large (my esearchdb is 2 megs,

194

>> but lets assume 10MB for this compressed index).

195

>>

196

>> I took a 10MB meg sqlite database and compressed it with gzip (default

197

>> settings) down to 5 megs.

198

>> gzip -d on the database takes 300ms, catting the decompressed data

199

>> base takes 88ms (average of 5 runs, drop disk caches between runs).

200

>>

201

>> I then tried my vdb_metadata.pickle from

202

>> /var/cache/edb/vdb_metadata.pickle

203

>>

204

>> 1.3megs compresses to 390k.

205

>>

206

>> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from

207

>> disk.

208

>>

209

>> Your index would have to be very large or very fragmented on disk

210

>> (requiring more than one seek) to see a significant gain in file

211

>> compression (gzip scales linearly).

212

>>

213

>> In short, don't compress the index ;p

214

>>

215

>> >

216

>> > Have luck!

217

>> >

218

>> >>> -----BEGIN PGP SIGNED MESSAGE-----

219

>> >>> Hash: SHA1

220

>> >>>

221

>> >>> Emma Strubell schrieb:

222

>> >>> > 2) does anyone really need to search an overlay anyway?

223

>> >>>

224

>> >>> Of course. Take large (semi-)official overlays like sunrise. They can

225

>> >>> easily be seen as a second portage tree.

226

>> >>> -----BEGIN PGP SIGNATURE-----

227

>> >>> Version: GnuPG v2.0.9 (GNU/Linux)

228

>> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

229

>> >>>

230

>> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt

231

>> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S

232

>> >>> =+lCO

233

>> >>> -----END PGP SIGNATURE-----

234

>> >>>

235

>> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@××××××.eu>

236

>> >> wrote:

237

>> >>

238

>> >

239

>> >

240

>

241

>

Gentoo Archives: gentoo-portage-dev

Replies

1	On Tue, Dec 2, 2008 at 4:42 AM, Tambet <qtvali@×××××.com> wrote:
2	> About zipping.. Default settings might not really be good idea - i think
3	> that "fastest" might be even better. Considering that portage tree contains
4	> same word again and again (like "applications") it needs pretty small
5	> dictionary to make it much smaller. Decompressing will not be reading from
6	> disc, decompressing and writing back to disc as in your case probably - try
7	> decompression to memory drive and you might get better numbers.
8
9	I ran gzip -d -c file.gz > /dev/null, which should not write to disk.
10
11	I tried again with gzip -1 and it still takes 29ms to decompress (even
12	with gzip -1) where a bare read takes 26ms. (I have a 2.6Ghz X2 which
13	is probably relevant to gzip decompression speed)
14
15	>
16	> I have personally used compression in one c++ application and with optimum
17	> settings, it made things much faster - those were files, where i had for
18	> example 65536 16-byte integers, which could be zeros and mostly were; I
19	> didnt care about creating better file format, but just compressed the whole
20	> thing.
21
22	I'm not saying compression won't make the index smaller. I'm saying
23	making the index smaller does not improve performance. If you have a
24	10 meg file and you make it 1 meg, you do not increase performance
25	because you (on average) are not saving enough time reading the
26	smaller file; since you pay it in decompressing the smaller file
27	later.
28
29	>
30	> I suggest you to compress esearch db, then decompress it to memory drive and
31	> give us those numbers - might be considerably faster.
32
33	gzip -d -c esearchdb.py.gz > /dev/null (compressed with gzip -1) takes
34	on average (6 trials, dropped caches between trials) takes 35.1666ms
35
36	cat esearchdb.py > /dev/null (uncompressed) takes on average of 6 trials, 24ms.
37
38	The point is you use compression when you need to save space (sending
39	data over the network, or storing large amounts of data or a lot of
40	something). The index isn't going to be big (if it is bigger than 20
41	or 30 meg I'll be surprised), the index isn't going over the network
42	and there is only 1 index, not say a million indexes (where
43	compression might actually be useful for some kind of LRU subset of
44	indexes to meet disk requirements).
45
46	Anyway this is all moot since as you stated so well earlier,
47	optimization comes last, so stop trying to do it now ;)
48
49	-Alec
50
51	>
52	> http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip support.
53	> Try open of that and normal open on esearch db; also compress with the same
54	> lib to get right kind of file.
55	>
56	> Anyway - maybe this compression should be later added and optional.
57	>
58	> Tambet - technique evolves to art, art evolves to magic, magic evolves to
59	> just doing.
60	>
61	>
62	> 2008/12/2 Alec Warner <antarus@g.o>
63	>>
64	>> On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@×××××.com> wrote:
65	>> > 2008/12/2 Emma Strubell <emma.strubell@×××××.com>
66	>> >>
67	>> >> True, true. Like I said, I don't really use overlays, so excuse my
68	>> >> igonrance.
69	>> >
70	>> > Do you know an order of doing things:
71	>> >
72	>> > Rules of Optimization:
73	>> >
74	>> > Rule 1: Don't do it.
75	>> > Rule 2 (for experts only): Don't do it yet.
76	>> >
77	>> > What this actually means - functionality comes first. Readability comes
78	>> > next. Optimization comes last. Unless you are creating a fancy 3D engine
79	>> > for
80	>> > kung fu game.
81	>> >
82	>> > If you are going to exclude overlays, you are removing functionality -
83	>> > and,
84	>> > indeed, absolutely has-to-be-there functionality, because noone would
85	>> > intuitively expect search function to search only one subset of
86	>> > packages,
87	>> > however reasonable this subset would be. So, you can't, just can't, add
88	>> > this
89	>> > package into portage base - you could write just another external search
90	>> > package for portage.
91	>> >
92	>> > I looked this code a bit and:
93	>> > Portage's "__init__.py" contains comment "# search functionality". After
94	>> > this comment, there is a nice and simple search class.
95	>> > It also contains method "def action_sync(...)", which contains
96	>> > synchronization stuff.
97	>> >
98	>> > Now, search class will be initialized by setting up 3 databases -
99	>> > porttree,
100	>> > bintree and vartree, whatever those are. Those will be in self._dbs
101	>> > array
102	>> > and porttree will be in self._portdb.
103	>> >
104	>> > It contains some more methods:
105	>> > _findname(...) will return result of self._portdb.findname(...) with
106	>> > same
107	>> > parameters or None if it does not exist.
108	>> > Other methods will do similar things - map one or another method.
109	>> > execute will do the real search...
110	>> > Now - "for package in self.portdb.cp_all()" is important here ...it
111	>> > currently loops over whole portage tree. All kinds of matching will be
112	>> > done
113	>> > inside.
114	>> > self.portdb obviously points to porttree.py (unless it points to fake
115	>> > tree).
116	>> > cp_all will take all porttrees and do simple file search inside. This
117	>> > method
118	>> > should contain optional index search.
119	>> >
120	>> > self.porttrees = [self.porttree_root] + \
121	>> > [os.path.realpath(t) for t in
122	>> > self.mysettings["PORTDIR_OVERLAY"].split()]
123	>> >
124	>> > So, self.porttrees contains list of trees - first of them is root,
125	>> > others
126	>> > are overlays.
127	>> >
128	>> > Now, what you have to do will not be harder just because of having
129	>> > overlay
130	>> > search, too.
131	>> >
132	>> > You have to create method def cp_index(self), which will return
133	>> > dictionary
134	>> > containing package names as keys. For oroot... will be
135	>> > "self.porttrees[1:]",
136	>> > not "self.porttrees" - this will only search overlays. d = {} will be
137	>> > replaced with d = self.cp_index(). If index is not there, old version
138	>> > will
139	>> > be used (thus, you have to make internal porttrees variable, which
140	>> > contains
141	>> > all or all except first).
142	>> >
143	>> > Other methods used by search are xmatch and aux_get - first used several
144	>> > times and last one used to get description. You have to cache results of
145	>> > those specific queries and make them use your cache - as you can see,
146	>> > those
147	>> > parts of portage are already able to use overlays. Thus, you have to put
148	>> > your code again in beginning of those functions - create index_xmatch
149	>> > and
150	>> > index_aux_get methods, then make those methods use them and return their
151	>> > results unless those are None (or something other in case none is
152	>> > already
153	>> > legal result) - if they return None, old code will be run and do it's
154	>> > job.
155	>> > If index is not created, result is None. In index_** methods, just check
156	>> > if
157	>> > query is what you can answer and if it is, then answer it.
158	>> >
159	>> > Obviously, the simplest way to create your index is to delete index,
160	>> > then
161	>> > use those same methods to query for all nessecary information - and
162	>> > fastest
163	>> > way would be to add updating index directly into sync, which you could
164	>> > do
165	>> > later.
166	>> >
167	>> > Please, also, make those commands to turn index on and off (last one
168	>> > should
169	>> > also delete it to save disk space). Default should be off until it's
170	>> > fast,
171	>> > small and reliable. Also notice that if index is kept on hard drive, it
172	>> > might be faster if it's compressed (gz, for example) - decompressing
173	>> > takes
174	>> > less time and more processing power than reading it fully out.
175	>>
176	>> I'm pretty sure your mistaken here, unless your index is stored on a
177	>> floppy or something really slow.
178	>>
179	>> A disk read has 2 primary costs.
180	>>
181	>> Seek Time: Time for the head to seek to the sector of disk you want.
182	>> Spin Time: Time for the platter to spin around such that the sector
183	>> you want is under the read head.
184	>>
185	>> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120
186	>> rotations per second, so worst case (you just passed the sector you
187	>> need) you need to wait 1/120th of a second (or 8ms).
188	>>
189	>> Seek Time is per hard drive, but most drives provide average seek
190	>> times under 10ms.
191	>>
192	>> So it takes on average 18ms to get to your data, then you start
193	>> reading. The index will not be that large (my esearchdb is 2 megs,
194	>> but lets assume 10MB for this compressed index).
195	>>
196	>> I took a 10MB meg sqlite database and compressed it with gzip (default
197	>> settings) down to 5 megs.
198	>> gzip -d on the database takes 300ms, catting the decompressed data
199	>> base takes 88ms (average of 5 runs, drop disk caches between runs).
200	>>
201	>> I then tried my vdb_metadata.pickle from
202	>> /var/cache/edb/vdb_metadata.pickle
203	>>
204	>> 1.3megs compresses to 390k.
205	>>
206	>> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from
207	>> disk.
208	>>
209	>> Your index would have to be very large or very fragmented on disk
210	>> (requiring more than one seek) to see a significant gain in file
211	>> compression (gzip scales linearly).
212	>>
213	>> In short, don't compress the index ;p
214	>>
215	>> >
216	>> > Have luck!
217	>> >
218	>> >>> -----BEGIN PGP SIGNED MESSAGE-----
219	>> >>> Hash: SHA1
220	>> >>>
221	>> >>> Emma Strubell schrieb:
222	>> >>> > 2) does anyone really need to search an overlay anyway?
223	>> >>>
224	>> >>> Of course. Take large (semi-)official overlays like sunrise. They can
225	>> >>> easily be seen as a second portage tree.
226	>> >>> -----BEGIN PGP SIGNATURE-----
227	>> >>> Version: GnuPG v2.0.9 (GNU/Linux)
228	>> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
229	>> >>>
230	>> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt
231	>> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S
232	>> >>> =+lCO
233	>> >>> -----END PGP SIGNATURE-----
234	>> >>>
235	>> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@××××××.eu>
236	>> >> wrote:
237	>> >>
238	>> >
239	>> >
240	>
241	>