Re: [gentoo-portage-dev] Re: search functionality in emerge - gentoo-portage-dev

From:	Tambet <qtvali@×××××.com>
To:	gentoo-portage-dev@l.g.o
Subject:	Re: [gentoo-portage-dev] Re: search functionality in emerge
Date:	Tue, 02 Dec 2008 17:56:08
Message-Id:	`cea53e3c0812020942n2b8da1d4v370e9b4f8cf7b1a5@mail.gmail.com`
In Reply to:	Re: [gentoo-portage-dev] Re: search functionality in emerge by Alec Warner

1

To remove this fuzz maybe some useflags would be possible - for example,

2

useflag which makes portage use SQLite. Then, it's db can used for indexing

3

and portage searches could be implemented as SQL queries. SQLite then does

4

what it can to make it fast and disk cache will make sure that for

5

simultaneous searches it's kept in memory.

6

7

8

Just some thoughts on compression...

9

10

http://en.wikipedia.org/wiki/Hard_disk 1.6GBit/s

11

http://www.dewassoc.com/performance/memory/memory_speeds.htm 1.6GB/s

12

So the *fastest *hard drives are 8 times slower, than the *fastest *memory

13

devices in pair.

14

15

Anyway, we live in the real world...

16

http://www.amazon.com/Maxtor-L01P200-Internal-7200-Drive/dp/B00007KVHZ up to

17

133MB/s

18

Therefore, the peak bandwidth for PC-266 DDR is 266 MHz x 8 Bytes = 2.1GB/s

19

20

Now, be fair - if you can read 133MB/s out of hard drive, it means 15 ms for

21

2MB file without seek times and partitioning. At the same time, you can

22

read/write about 31.5MB of data in memory, doing different operations with

23

it. If your compression algorithm has dictionary of repeated words, which

24

repeat enough times to make actual file size 0.5MB smaller, you win 7.5ms

25

and loose about 1ms for seeking this thing. Processor usage might be

26

somewhat bigger, but you really don't care about processor usage when

27

searching packages and it's not *that* big.

28

29

What it really gives us is ability to make our file format highly

30

inefficient in terms of disk space - for example, what I mentioned before

31

was 65536 64-bit integers (there was a mistake as I said they were 16 byte,

32

they were 8 bytes long). Most of them started with at least 32 bits of

33

zeros, which were wiped out by compression lib, but which I needed to do

34

fast operations. It doesnt matter much, how data is stored, but there are

35

repeated patterns if you want to make it fast.

36

37

I really don't know if pickle has some good file format included or not. For

38

common pickle files, I think that compression is noticeable. Anyway, using

39

database, for example, would be clever idea and making keyword search fully

40

SQL-query specific makes it clearly better and removes chance to compress.

41

42

Tambet - technique evolves to art, art evolves to magic, magic evolves to

43

just doing.

44

45

46

2008/12/2 Alec Warner <antarus@g.o>

47

48

> On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@×××××.com> wrote:

49

> > 2008/12/2 Emma Strubell <emma.strubell@×××××.com>

50

> >>

51

> >> True, true. Like I said, I don't really use overlays, so excuse my

52

> >> igonrance.

53

> >

54

> > Do you know an order of doing things:

55

> >

56

> > Rules of Optimization:

57

> >

58

> > Rule 1: Don't do it.

59

> > Rule 2 (for experts only): Don't do it yet.

60

> >

61

> > What this actually means - functionality comes first. Readability comes

62

> > next. Optimization comes last. Unless you are creating a fancy 3D engine

63

> for

64

> > kung fu game.

65

> >

66

> > If you are going to exclude overlays, you are removing functionality -

67

> and,

68

> > indeed, absolutely has-to-be-there functionality, because noone would

69

> > intuitively expect search function to search only one subset of packages,

70

> > however reasonable this subset would be. So, you can't, just can't, add

71

> this

72

> > package into portage base - you could write just another external search

73

> > package for portage.

74

> >

75

> > I looked this code a bit and:

76

> > Portage's "__init__.py" contains comment "# search functionality". After

77

> > this comment, there is a nice and simple search class.

78

> > It also contains method "def action_sync(...)", which contains

79

> > synchronization stuff.

80

> >

81

> > Now, search class will be initialized by setting up 3 databases -

82

> porttree,

83

> > bintree and vartree, whatever those are. Those will be in self._dbs array

84

> > and porttree will be in self._portdb.

85

> >

86

> > It contains some more methods:

87

> > _findname(...) will return result of self._portdb.findname(...) with same

88

> > parameters or None if it does not exist.

89

> > Other methods will do similar things - map one or another method.

90

> > execute will do the real search...

91

> > Now - "for package in self.portdb.cp_all()" is important here ...it

92

> > currently loops over whole portage tree. All kinds of matching will be

93

> done

94

> > inside.

95

> > self.portdb obviously points to porttree.py (unless it points to fake

96

> tree).

97

> > cp_all will take all porttrees and do simple file search inside. This

98

> method

99

> > should contain optional index search.

100

> >

101

> >               self.porttrees = [self.porttree_root] + \

102

> >                       [os.path.realpath(t) for t in

103

> self.mysettings["PORTDIR_OVERLAY"].split()]

104

> >

105

> > So, self.porttrees contains list of trees - first of them is root, others

106

> > are overlays.

107

> >

108

> > Now, what you have to do will not be harder just because of having

109

> overlay

110

> > search, too.

111

> >

112

> > You have to create method def cp_index(self), which will return

113

> dictionary

114

> > containing package names as keys. For oroot... will be

115

> "self.porttrees[1:]",

116

> > not "self.porttrees" - this will only search overlays. d = {} will be

117

> > replaced with d = self.cp_index(). If index is not there, old version

118

> will

119

> > be used (thus, you have to make internal porttrees variable, which

120

> contains

121

> > all or all except first).

122

> >

123

> > Other methods used by search are xmatch and aux_get - first used several

124

> > times and last one used to get description. You have to cache results of

125

> > those specific queries and make them use your cache - as you can see,

126

> those

127

> > parts of portage are already able to use overlays. Thus, you have to put

128

> > your code again in beginning of those functions - create index_xmatch and

129

> > index_aux_get methods, then make those methods use them and return their

130

> > results unless those are None (or something other in case none is already

131

> > legal result) - if they return None, old code will be run and do it's

132

> job.

133

> > If index is not created, result is None. In index_** methods, just check

134

> if

135

> > query is what you can answer and if it is, then answer it.

136

> >

137

> > Obviously, the simplest way to create your index is to delete index, then

138

> > use those same methods to query for all nessecary information - and

139

> fastest

140

> > way would be to add updating index directly into sync, which you could do

141

> > later.

142

> >

143

> > Please, also, make those commands to turn index on and off (last one

144

> should

145

> > also delete it to save disk space). Default should be off until it's

146

> fast,

147

> > small and reliable. Also notice that if index is kept on hard drive, it

148

> > might be faster if it's compressed (gz, for example) - decompressing

149

> takes

150

> > less time and more processing power than reading it fully out.

151

>

152

> I'm pretty sure your mistaken here, unless your index is stored on a

153

> floppy or something really slow.

154

>

155

> A disk read has 2 primary costs.

156

>

157

> Seek Time: Time for the head to seek to the sector of disk you want.

158

> Spin Time: Time for the platter to spin around such that the sector

159

> you want is under the read head.

160

>

161

> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120

162

> rotations per second, so worst case (you just passed the sector you

163

> need) you need to wait 1/120th of a second (or 8ms).

164

>

165

> Seek Time is per hard drive, but most drives provide average seek

166

> times under 10ms.

167

>

168

> So it takes on average 18ms to get to your data, then you start

169

> reading.  The index will not be that large (my esearchdb is 2 megs,

170

> but lets assume 10MB for this compressed index).

171

>

172

> I took a 10MB meg sqlite database and compressed it with gzip (default

173

> settings) down to 5 megs.

174

> gzip -d on the database takes 300ms, catting the decompressed data

175

> base takes 88ms (average of 5 runs, drop disk caches between runs).

176

>

177

> I then tried my vdb_metadata.pickle from /var/cache/edb/vdb_metadata.pickle

178

>

179

> 1.3megs compresses to 390k.

180

>

181

> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from

182

> disk.

183

>

184

> Your index would have to be very large or very fragmented on disk

185

> (requiring more than one seek) to see a significant gain in file

186

> compression (gzip scales linearly).

187

>

188

> In short, don't compress the index ;p

189

>

190

> >

191

> > Have luck!

192

> >

193

> >>> -----BEGIN PGP SIGNED MESSAGE-----

194

> >>> Hash: SHA1

195

> >>>

196

> >>> Emma Strubell schrieb:

197

> >>> > 2) does anyone really need to search an overlay anyway?

198

> >>>

199

> >>> Of course. Take large (semi-)official overlays like sunrise. They can

200

> >>> easily be seen as a second portage tree.

201

> >>> -----BEGIN PGP SIGNATURE-----

202

> >>> Version: GnuPG v2.0.9 (GNU/Linux)

203

> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

204

> >>>

205

> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt

206

> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S

207

> >>> =+lCO

208

> >>> -----END PGP SIGNATURE-----

209

> >>>

210

> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@××××××.eu>

211

> >> wrote:

212

> >>

213

> >

214

> >

215

>

1	To remove this fuzz maybe some useflags would be possible - for example,
2	useflag which makes portage use SQLite. Then, it's db can used for indexing
3	and portage searches could be implemented as SQL queries. SQLite then does
4	what it can to make it fast and disk cache will make sure that for
5	simultaneous searches it's kept in memory.
6
7
8	Just some thoughts on compression...
9
10	http://en.wikipedia.org/wiki/Hard_disk 1.6GBit/s
11	http://www.dewassoc.com/performance/memory/memory_speeds.htm 1.6GB/s
12	So the fastest hard drives are 8 times slower, than the fastest memory
13	devices in pair.
14
15	Anyway, we live in the real world...
16	http://www.amazon.com/Maxtor-L01P200-Internal-7200-Drive/dp/B00007KVHZ up to
17	133MB/s
18	Therefore, the peak bandwidth for PC-266 DDR is 266 MHz x 8 Bytes = 2.1GB/s
19
20	Now, be fair - if you can read 133MB/s out of hard drive, it means 15 ms for
21	2MB file without seek times and partitioning. At the same time, you can
22	read/write about 31.5MB of data in memory, doing different operations with
23	it. If your compression algorithm has dictionary of repeated words, which
24	repeat enough times to make actual file size 0.5MB smaller, you win 7.5ms
25	and loose about 1ms for seeking this thing. Processor usage might be
26	somewhat bigger, but you really don't care about processor usage when
27	searching packages and it's not that big.
28
29	What it really gives us is ability to make our file format highly
30	inefficient in terms of disk space - for example, what I mentioned before
31	was 65536 64-bit integers (there was a mistake as I said they were 16 byte,
32	they were 8 bytes long). Most of them started with at least 32 bits of
33	zeros, which were wiped out by compression lib, but which I needed to do
34	fast operations. It doesnt matter much, how data is stored, but there are
35	repeated patterns if you want to make it fast.
36
37	I really don't know if pickle has some good file format included or not. For
38	common pickle files, I think that compression is noticeable. Anyway, using
39	database, for example, would be clever idea and making keyword search fully
40	SQL-query specific makes it clearly better and removes chance to compress.
41
42	Tambet - technique evolves to art, art evolves to magic, magic evolves to
43	just doing.
44
45
46	2008/12/2 Alec Warner <antarus@g.o>
47
48	> On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@×××××.com> wrote:
49	> > 2008/12/2 Emma Strubell <emma.strubell@×××××.com>
50	> >>
51	> >> True, true. Like I said, I don't really use overlays, so excuse my
52	> >> igonrance.
53	> >
54	> > Do you know an order of doing things:
55	> >
56	> > Rules of Optimization:
57	> >
58	> > Rule 1: Don't do it.
59	> > Rule 2 (for experts only): Don't do it yet.
60	> >
61	> > What this actually means - functionality comes first. Readability comes
62	> > next. Optimization comes last. Unless you are creating a fancy 3D engine
63	> for
64	> > kung fu game.
65	> >
66	> > If you are going to exclude overlays, you are removing functionality -
67	> and,
68	> > indeed, absolutely has-to-be-there functionality, because noone would
69	> > intuitively expect search function to search only one subset of packages,
70	> > however reasonable this subset would be. So, you can't, just can't, add
71	> this
72	> > package into portage base - you could write just another external search
73	> > package for portage.
74	> >
75	> > I looked this code a bit and:
76	> > Portage's "__init__.py" contains comment "# search functionality". After
77	> > this comment, there is a nice and simple search class.
78	> > It also contains method "def action_sync(...)", which contains
79	> > synchronization stuff.
80	> >
81	> > Now, search class will be initialized by setting up 3 databases -
82	> porttree,
83	> > bintree and vartree, whatever those are. Those will be in self._dbs array
84	> > and porttree will be in self._portdb.
85	> >
86	> > It contains some more methods:
87	> > _findname(...) will return result of self._portdb.findname(...) with same
88	> > parameters or None if it does not exist.
89	> > Other methods will do similar things - map one or another method.
90	> > execute will do the real search...
91	> > Now - "for package in self.portdb.cp_all()" is important here ...it
92	> > currently loops over whole portage tree. All kinds of matching will be
93	> done
94	> > inside.
95	> > self.portdb obviously points to porttree.py (unless it points to fake
96	> tree).
97	> > cp_all will take all porttrees and do simple file search inside. This
98	> method
99	> > should contain optional index search.
100	> >
101	> > self.porttrees = [self.porttree_root] + \
102	> > [os.path.realpath(t) for t in
103	> self.mysettings["PORTDIR_OVERLAY"].split()]
104	> >
105	> > So, self.porttrees contains list of trees - first of them is root, others
106	> > are overlays.
107	> >
108	> > Now, what you have to do will not be harder just because of having
109	> overlay
110	> > search, too.
111	> >
112	> > You have to create method def cp_index(self), which will return
113	> dictionary
114	> > containing package names as keys. For oroot... will be
115	> "self.porttrees[1:]",
116	> > not "self.porttrees" - this will only search overlays. d = {} will be
117	> > replaced with d = self.cp_index(). If index is not there, old version
118	> will
119	> > be used (thus, you have to make internal porttrees variable, which
120	> contains
121	> > all or all except first).
122	> >
123	> > Other methods used by search are xmatch and aux_get - first used several
124	> > times and last one used to get description. You have to cache results of
125	> > those specific queries and make them use your cache - as you can see,
126	> those
127	> > parts of portage are already able to use overlays. Thus, you have to put
128	> > your code again in beginning of those functions - create index_xmatch and
129	> > index_aux_get methods, then make those methods use them and return their
130	> > results unless those are None (or something other in case none is already
131	> > legal result) - if they return None, old code will be run and do it's
132	> job.
133	> > If index is not created, result is None. In index_** methods, just check
134	> if
135	> > query is what you can answer and if it is, then answer it.
136	> >
137	> > Obviously, the simplest way to create your index is to delete index, then
138	> > use those same methods to query for all nessecary information - and
139	> fastest
140	> > way would be to add updating index directly into sync, which you could do
141	> > later.
142	> >
143	> > Please, also, make those commands to turn index on and off (last one
144	> should
145	> > also delete it to save disk space). Default should be off until it's
146	> fast,
147	> > small and reliable. Also notice that if index is kept on hard drive, it
148	> > might be faster if it's compressed (gz, for example) - decompressing
149	> takes
150	> > less time and more processing power than reading it fully out.
151	>
152	> I'm pretty sure your mistaken here, unless your index is stored on a
153	> floppy or something really slow.
154	>
155	> A disk read has 2 primary costs.
156	>
157	> Seek Time: Time for the head to seek to the sector of disk you want.
158	> Spin Time: Time for the platter to spin around such that the sector
159	> you want is under the read head.
160	>
161	> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120
162	> rotations per second, so worst case (you just passed the sector you
163	> need) you need to wait 1/120th of a second (or 8ms).
164	>
165	> Seek Time is per hard drive, but most drives provide average seek
166	> times under 10ms.
167	>
168	> So it takes on average 18ms to get to your data, then you start
169	> reading. The index will not be that large (my esearchdb is 2 megs,
170	> but lets assume 10MB for this compressed index).
171	>
172	> I took a 10MB meg sqlite database and compressed it with gzip (default
173	> settings) down to 5 megs.
174	> gzip -d on the database takes 300ms, catting the decompressed data
175	> base takes 88ms (average of 5 runs, drop disk caches between runs).
176	>
177	> I then tried my vdb_metadata.pickle from /var/cache/edb/vdb_metadata.pickle
178	>
179	> 1.3megs compresses to 390k.
180	>
181	> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from
182	> disk.
183	>
184	> Your index would have to be very large or very fragmented on disk
185	> (requiring more than one seek) to see a significant gain in file
186	> compression (gzip scales linearly).
187	>
188	> In short, don't compress the index ;p
189	>
190	> >
191	> > Have luck!
192	> >
193	> >>> -----BEGIN PGP SIGNED MESSAGE-----
194	> >>> Hash: SHA1
195	> >>>
196	> >>> Emma Strubell schrieb:
197	> >>> > 2) does anyone really need to search an overlay anyway?
198	> >>>
199	> >>> Of course. Take large (semi-)official overlays like sunrise. They can
200	> >>> easily be seen as a second portage tree.
201	> >>> -----BEGIN PGP SIGNATURE-----
202	> >>> Version: GnuPG v2.0.9 (GNU/Linux)
203	> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
204	> >>>
205	> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt
206	> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S
207	> >>> =+lCO
208	> >>> -----END PGP SIGNATURE-----
209	> >>>
210	> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@××××××.eu>
211	> >> wrote:
212	> >>
213	> >
214	> >
215	>

Gentoo Archives: gentoo-portage-dev