Re: [gentoo-portage-dev] Re: search functionality in emerge - gentoo-portage-dev

From:	Tambet <qtvali@×××××.com>
To:	gentoo-portage-dev@l.g.o
Subject:	Re: [gentoo-portage-dev] Re: search functionality in emerge
Date:	Tue, 02 Dec 2008 13:51:59
Message-Id:	`cea53e3c0812020551k135b0c22nbfe0e4aced586c38@mail.gmail.com`
In Reply to:	Re: [gentoo-portage-dev] Re: search functionality in emerge by Tambet

1

Btw. one way of index update would be:

2

3

emerge --sync:

4

5

   - Add delete index if it exists

6

7

emerge --searchindexon

8

9

   - Turns index on

10

11

emerge --searchindexoff

12

13

   - Turns index off

14

   - Deletes index if it exists

15

16

emerge -s or emerge -S

17

18

   - Does normal search if index is off

19

   - If index is on and does not exist, then creates index on the fly

20

   - If index is on and does exist, then uses it

21

22

Tambet - technique evolves to art, art evolves to magic, magic evolves to

23

just doing.

24

25

26

2008/12/2 Tambet <qtvali@×××××.com>

27

28

> About zipping.. Default settings might not really be good idea - i think

29

> that "fastest" might be even better. Considering that portage tree contains

30

> same word again and again (like "applications") it needs pretty small

31

> dictionary to make it much smaller. Decompressing will not be reading from

32

> disc, decompressing and writing back to disc as in your case probably - try

33

> decompression to memory drive and you might get better numbers.

34

>

35

> I have personally used compression in one c++ application and with optimum

36

> settings, it made things much faster - those were files, where i had for

37

> example 65536 16-byte integers, which could be zeros and mostly were; I

38

> didnt care about creating better file format, but just compressed the whole

39

> thing.

40

>

41

> I suggest you to compress esearch db, then decompress it to memory drive

42

> and give us those numbers - might be considerably faster.

43

>

44

> http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip

45

> support. Try open of that and normal open on esearch db; also compress with

46

> the same lib to get right kind of file.

47

>

48

> Anyway - maybe this compression should be later added and optional.

49

>

50

> Tambet - technique evolves to art, art evolves to magic, magic evolves to

51

> just doing.

52

>

53

>

54

> 2008/12/2 Alec Warner <antarus@g.o>

55

>

56

> On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@×××××.com> wrote:

57

>> > 2008/12/2 Emma Strubell <emma.strubell@×××××.com>

58

>> >>

59

>> >> True, true. Like I said, I don't really use overlays, so excuse my

60

>> >> igonrance.

61

>> >

62

>> > Do you know an order of doing things:

63

>> >

64

>> > Rules of Optimization:

65

>> >

66

>> > Rule 1: Don't do it.

67

>> > Rule 2 (for experts only): Don't do it yet.

68

>> >

69

>> > What this actually means - functionality comes first. Readability comes

70

>> > next. Optimization comes last. Unless you are creating a fancy 3D engine

71

>> for

72

>> > kung fu game.

73

>> >

74

>> > If you are going to exclude overlays, you are removing functionality -

75

>> and,

76

>> > indeed, absolutely has-to-be-there functionality, because noone would

77

>> > intuitively expect search function to search only one subset of

78

>> packages,

79

>> > however reasonable this subset would be. So, you can't, just can't, add

80

>> this

81

>> > package into portage base - you could write just another external search

82

>> > package for portage.

83

>> >

84

>> > I looked this code a bit and:

85

>> > Portage's "__init__.py" contains comment "# search functionality". After

86

>> > this comment, there is a nice and simple search class.

87

>> > It also contains method "def action_sync(...)", which contains

88

>> > synchronization stuff.

89

>> >

90

>> > Now, search class will be initialized by setting up 3 databases -

91

>> porttree,

92

>> > bintree and vartree, whatever those are. Those will be in self._dbs

93

>> array

94

>> > and porttree will be in self._portdb.

95

>> >

96

>> > It contains some more methods:

97

>> > _findname(...) will return result of self._portdb.findname(...) with

98

>> same

99

>> > parameters or None if it does not exist.

100

>> > Other methods will do similar things - map one or another method.

101

>> > execute will do the real search...

102

>> > Now - "for package in self.portdb.cp_all()" is important here ...it

103

>> > currently loops over whole portage tree. All kinds of matching will be

104

>> done

105

>> > inside.

106

>> > self.portdb obviously points to porttree.py (unless it points to fake

107

>> tree).

108

>> > cp_all will take all porttrees and do simple file search inside. This

109

>> method

110

>> > should contain optional index search.

111

>> >

112

>> >               self.porttrees = [self.porttree_root] + \

113

>> >                       [os.path.realpath(t) for t in

114

>> self.mysettings["PORTDIR_OVERLAY"].split()]

115

>> >

116

>> > So, self.porttrees contains list of trees - first of them is root,

117

>> others

118

>> > are overlays.

119

>> >

120

>> > Now, what you have to do will not be harder just because of having

121

>> overlay

122

>> > search, too.

123

>> >

124

>> > You have to create method def cp_index(self), which will return

125

>> dictionary

126

>> > containing package names as keys. For oroot... will be

127

>> "self.porttrees[1:]",

128

>> > not "self.porttrees" - this will only search overlays. d = {} will be

129

>> > replaced with d = self.cp_index(). If index is not there, old version

130

>> will

131

>> > be used (thus, you have to make internal porttrees variable, which

132

>> contains

133

>> > all or all except first).

134

>> >

135

>> > Other methods used by search are xmatch and aux_get - first used several

136

>> > times and last one used to get description. You have to cache results of

137

>> > those specific queries and make them use your cache - as you can see,

138

>> those

139

>> > parts of portage are already able to use overlays. Thus, you have to put

140

>> > your code again in beginning of those functions - create index_xmatch

141

>> and

142

>> > index_aux_get methods, then make those methods use them and return their

143

>> > results unless those are None (or something other in case none is

144

>> already

145

>> > legal result) - if they return None, old code will be run and do it's

146

>> job.

147

>> > If index is not created, result is None. In index_** methods, just check

148

>> if

149

>> > query is what you can answer and if it is, then answer it.

150

>> >

151

>> > Obviously, the simplest way to create your index is to delete index,

152

>> then

153

>> > use those same methods to query for all nessecary information - and

154

>> fastest

155

>> > way would be to add updating index directly into sync, which you could

156

>> do

157

>> > later.

158

>> >

159

>> > Please, also, make those commands to turn index on and off (last one

160

>> should

161

>> > also delete it to save disk space). Default should be off until it's

162

>> fast,

163

>> > small and reliable. Also notice that if index is kept on hard drive, it

164

>> > might be faster if it's compressed (gz, for example) - decompressing

165

>> takes

166

>> > less time and more processing power than reading it fully out.

167

>>

168

>> I'm pretty sure your mistaken here, unless your index is stored on a

169

>> floppy or something really slow.

170

>>

171

>> A disk read has 2 primary costs.

172

>>

173

>> Seek Time: Time for the head to seek to the sector of disk you want.

174

>> Spin Time: Time for the platter to spin around such that the sector

175

>> you want is under the read head.

176

>>

177

>> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120

178

>> rotations per second, so worst case (you just passed the sector you

179

>> need) you need to wait 1/120th of a second (or 8ms).

180

>>

181

>> Seek Time is per hard drive, but most drives provide average seek

182

>> times under 10ms.

183

>>

184

>> So it takes on average 18ms to get to your data, then you start

185

>> reading.  The index will not be that large (my esearchdb is 2 megs,

186

>> but lets assume 10MB for this compressed index).

187

>>

188

>> I took a 10MB meg sqlite database and compressed it with gzip (default

189

>> settings) down to 5 megs.

190

>> gzip -d on the database takes 300ms, catting the decompressed data

191

>> base takes 88ms (average of 5 runs, drop disk caches between runs).

192

>>

193

>> I then tried my vdb_metadata.pickle from

194

>> /var/cache/edb/vdb_metadata.pickle

195

>>

196

>> 1.3megs compresses to 390k.

197

>>

198

>> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from

199

>> disk.

200

>>

201

>> Your index would have to be very large or very fragmented on disk

202

>> (requiring more than one seek) to see a significant gain in file

203

>> compression (gzip scales linearly).

204

>>

205

>> In short, don't compress the index ;p

206

>>

207

>> >

208

>> > Have luck!

209

>> >

210

>> >>> -----BEGIN PGP SIGNED MESSAGE-----

211

>> >>> Hash: SHA1

212

>> >>>

213

>> >>> Emma Strubell schrieb:

214

>> >>> > 2) does anyone really need to search an overlay anyway?

215

>> >>>

216

>> >>> Of course. Take large (semi-)official overlays like sunrise. They can

217

>> >>> easily be seen as a second portage tree.

218

>> >>> -----BEGIN PGP SIGNATURE-----

219

>> >>> Version: GnuPG v2.0.9 (GNU/Linux)

220

>> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

221

>> >>>

222

>> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt

223

>> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S

224

>> >>> =+lCO

225

>> >>> -----END PGP SIGNATURE-----

226

>> >>>

227

>> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@××××××.eu

228

>> >

229

>> >> wrote:

230

>> >>

231

>> >

232

>> >

233

>>

234

>

235

>

1	Btw. one way of index update would be:
2
3	emerge --sync:
4
5	- Add delete index if it exists
6
7	emerge --searchindexon
8
9	- Turns index on
10
11	emerge --searchindexoff
12
13	- Turns index off
14	- Deletes index if it exists
15
16	emerge -s or emerge -S
17
18	- Does normal search if index is off
19	- If index is on and does not exist, then creates index on the fly
20	- If index is on and does exist, then uses it
21
22	Tambet - technique evolves to art, art evolves to magic, magic evolves to
23	just doing.
24
25
26	2008/12/2 Tambet <qtvali@×××××.com>
27
28	> About zipping.. Default settings might not really be good idea - i think
29	> that "fastest" might be even better. Considering that portage tree contains
30	> same word again and again (like "applications") it needs pretty small
31	> dictionary to make it much smaller. Decompressing will not be reading from
32	> disc, decompressing and writing back to disc as in your case probably - try
33	> decompression to memory drive and you might get better numbers.
34	>
35	> I have personally used compression in one c++ application and with optimum
36	> settings, it made things much faster - those were files, where i had for
37	> example 65536 16-byte integers, which could be zeros and mostly were; I
38	> didnt care about creating better file format, but just compressed the whole
39	> thing.
40	>
41	> I suggest you to compress esearch db, then decompress it to memory drive
42	> and give us those numbers - might be considerably faster.
43	>
44	> http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip
45	> support. Try open of that and normal open on esearch db; also compress with
46	> the same lib to get right kind of file.
47	>
48	> Anyway - maybe this compression should be later added and optional.
49	>
50	> Tambet - technique evolves to art, art evolves to magic, magic evolves to
51	> just doing.
52	>
53	>
54	> 2008/12/2 Alec Warner <antarus@g.o>
55	>
56	> On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@×××××.com> wrote:
57	>> > 2008/12/2 Emma Strubell <emma.strubell@×××××.com>
58	>> >>
59	>> >> True, true. Like I said, I don't really use overlays, so excuse my
60	>> >> igonrance.
61	>> >
62	>> > Do you know an order of doing things:
63	>> >
64	>> > Rules of Optimization:
65	>> >
66	>> > Rule 1: Don't do it.
67	>> > Rule 2 (for experts only): Don't do it yet.
68	>> >
69	>> > What this actually means - functionality comes first. Readability comes
70	>> > next. Optimization comes last. Unless you are creating a fancy 3D engine
71	>> for
72	>> > kung fu game.
73	>> >
74	>> > If you are going to exclude overlays, you are removing functionality -
75	>> and,
76	>> > indeed, absolutely has-to-be-there functionality, because noone would
77	>> > intuitively expect search function to search only one subset of
78	>> packages,
79	>> > however reasonable this subset would be. So, you can't, just can't, add
80	>> this
81	>> > package into portage base - you could write just another external search
82	>> > package for portage.
83	>> >
84	>> > I looked this code a bit and:
85	>> > Portage's "__init__.py" contains comment "# search functionality". After
86	>> > this comment, there is a nice and simple search class.
87	>> > It also contains method "def action_sync(...)", which contains
88	>> > synchronization stuff.
89	>> >
90	>> > Now, search class will be initialized by setting up 3 databases -
91	>> porttree,
92	>> > bintree and vartree, whatever those are. Those will be in self._dbs
93	>> array
94	>> > and porttree will be in self._portdb.
95	>> >
96	>> > It contains some more methods:
97	>> > _findname(...) will return result of self._portdb.findname(...) with
98	>> same
99	>> > parameters or None if it does not exist.
100	>> > Other methods will do similar things - map one or another method.
101	>> > execute will do the real search...
102	>> > Now - "for package in self.portdb.cp_all()" is important here ...it
103	>> > currently loops over whole portage tree. All kinds of matching will be
104	>> done
105	>> > inside.
106	>> > self.portdb obviously points to porttree.py (unless it points to fake
107	>> tree).
108	>> > cp_all will take all porttrees and do simple file search inside. This
109	>> method
110	>> > should contain optional index search.
111	>> >
112	>> > self.porttrees = [self.porttree_root] + \
113	>> > [os.path.realpath(t) for t in
114	>> self.mysettings["PORTDIR_OVERLAY"].split()]
115	>> >
116	>> > So, self.porttrees contains list of trees - first of them is root,
117	>> others
118	>> > are overlays.
119	>> >
120	>> > Now, what you have to do will not be harder just because of having
121	>> overlay
122	>> > search, too.
123	>> >
124	>> > You have to create method def cp_index(self), which will return
125	>> dictionary
126	>> > containing package names as keys. For oroot... will be
127	>> "self.porttrees[1:]",
128	>> > not "self.porttrees" - this will only search overlays. d = {} will be
129	>> > replaced with d = self.cp_index(). If index is not there, old version
130	>> will
131	>> > be used (thus, you have to make internal porttrees variable, which
132	>> contains
133	>> > all or all except first).
134	>> >
135	>> > Other methods used by search are xmatch and aux_get - first used several
136	>> > times and last one used to get description. You have to cache results of
137	>> > those specific queries and make them use your cache - as you can see,
138	>> those
139	>> > parts of portage are already able to use overlays. Thus, you have to put
140	>> > your code again in beginning of those functions - create index_xmatch
141	>> and
142	>> > index_aux_get methods, then make those methods use them and return their
143	>> > results unless those are None (or something other in case none is
144	>> already
145	>> > legal result) - if they return None, old code will be run and do it's
146	>> job.
147	>> > If index is not created, result is None. In index_** methods, just check
148	>> if
149	>> > query is what you can answer and if it is, then answer it.
150	>> >
151	>> > Obviously, the simplest way to create your index is to delete index,
152	>> then
153	>> > use those same methods to query for all nessecary information - and
154	>> fastest
155	>> > way would be to add updating index directly into sync, which you could
156	>> do
157	>> > later.
158	>> >
159	>> > Please, also, make those commands to turn index on and off (last one
160	>> should
161	>> > also delete it to save disk space). Default should be off until it's
162	>> fast,
163	>> > small and reliable. Also notice that if index is kept on hard drive, it
164	>> > might be faster if it's compressed (gz, for example) - decompressing
165	>> takes
166	>> > less time and more processing power than reading it fully out.
167	>>
168	>> I'm pretty sure your mistaken here, unless your index is stored on a
169	>> floppy or something really slow.
170	>>
171	>> A disk read has 2 primary costs.
172	>>
173	>> Seek Time: Time for the head to seek to the sector of disk you want.
174	>> Spin Time: Time for the platter to spin around such that the sector
175	>> you want is under the read head.
176	>>
177	>> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120
178	>> rotations per second, so worst case (you just passed the sector you
179	>> need) you need to wait 1/120th of a second (or 8ms).
180	>>
181	>> Seek Time is per hard drive, but most drives provide average seek
182	>> times under 10ms.
183	>>
184	>> So it takes on average 18ms to get to your data, then you start
185	>> reading. The index will not be that large (my esearchdb is 2 megs,
186	>> but lets assume 10MB for this compressed index).
187	>>
188	>> I took a 10MB meg sqlite database and compressed it with gzip (default
189	>> settings) down to 5 megs.
190	>> gzip -d on the database takes 300ms, catting the decompressed data
191	>> base takes 88ms (average of 5 runs, drop disk caches between runs).
192	>>
193	>> I then tried my vdb_metadata.pickle from
194	>> /var/cache/edb/vdb_metadata.pickle
195	>>
196	>> 1.3megs compresses to 390k.
197	>>
198	>> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from
199	>> disk.
200	>>
201	>> Your index would have to be very large or very fragmented on disk
202	>> (requiring more than one seek) to see a significant gain in file
203	>> compression (gzip scales linearly).
204	>>
205	>> In short, don't compress the index ;p
206	>>
207	>> >
208	>> > Have luck!
209	>> >
210	>> >>> -----BEGIN PGP SIGNED MESSAGE-----
211	>> >>> Hash: SHA1
212	>> >>>
213	>> >>> Emma Strubell schrieb:
214	>> >>> > 2) does anyone really need to search an overlay anyway?
215	>> >>>
216	>> >>> Of course. Take large (semi-)official overlays like sunrise. They can
217	>> >>> easily be seen as a second portage tree.
218	>> >>> -----BEGIN PGP SIGNATURE-----
219	>> >>> Version: GnuPG v2.0.9 (GNU/Linux)
220	>> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
221	>> >>>
222	>> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt
223	>> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S
224	>> >>> =+lCO
225	>> >>> -----END PGP SIGNATURE-----
226	>> >>>
227	>> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@××××××.eu
228	>> >
229	>> >> wrote:
230	>> >>
231	>> >
232	>> >
233	>>
234	>
235	>

Gentoo Archives: gentoo-portage-dev