Gentoo Archives: gentoo-portage-dev

From: Tambet <qtvali@×××××.com>
To: gentoo-portage-dev@l.g.o
Subject: Re: [gentoo-portage-dev] Re: search functionality in emerge
Date: Tue, 02 Dec 2008 13:51:59
Message-Id: cea53e3c0812020551k135b0c22nbfe0e4aced586c38@mail.gmail.com
In Reply to: Re: [gentoo-portage-dev] Re: search functionality in emerge by Tambet
1 Btw. one way of index update would be:
2
3 emerge --sync:
4
5 - Add delete index if it exists
6
7 emerge --searchindexon
8
9 - Turns index on
10
11 emerge --searchindexoff
12
13 - Turns index off
14 - Deletes index if it exists
15
16 emerge -s or emerge -S
17
18 - Does normal search if index is off
19 - If index is on and does not exist, then creates index on the fly
20 - If index is on and does exist, then uses it
21
22 Tambet - technique evolves to art, art evolves to magic, magic evolves to
23 just doing.
24
25
26 2008/12/2 Tambet <qtvali@×××××.com>
27
28 > About zipping.. Default settings might not really be good idea - i think
29 > that "fastest" might be even better. Considering that portage tree contains
30 > same word again and again (like "applications") it needs pretty small
31 > dictionary to make it much smaller. Decompressing will not be reading from
32 > disc, decompressing and writing back to disc as in your case probably - try
33 > decompression to memory drive and you might get better numbers.
34 >
35 > I have personally used compression in one c++ application and with optimum
36 > settings, it made things much faster - those were files, where i had for
37 > example 65536 16-byte integers, which could be zeros and mostly were; I
38 > didnt care about creating better file format, but just compressed the whole
39 > thing.
40 >
41 > I suggest you to compress esearch db, then decompress it to memory drive
42 > and give us those numbers - might be considerably faster.
43 >
44 > http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip
45 > support. Try open of that and normal open on esearch db; also compress with
46 > the same lib to get right kind of file.
47 >
48 > Anyway - maybe this compression should be later added and optional.
49 >
50 > Tambet - technique evolves to art, art evolves to magic, magic evolves to
51 > just doing.
52 >
53 >
54 > 2008/12/2 Alec Warner <antarus@g.o>
55 >
56 > On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@×××××.com> wrote:
57 >> > 2008/12/2 Emma Strubell <emma.strubell@×××××.com>
58 >> >>
59 >> >> True, true. Like I said, I don't really use overlays, so excuse my
60 >> >> igonrance.
61 >> >
62 >> > Do you know an order of doing things:
63 >> >
64 >> > Rules of Optimization:
65 >> >
66 >> > Rule 1: Don't do it.
67 >> > Rule 2 (for experts only): Don't do it yet.
68 >> >
69 >> > What this actually means - functionality comes first. Readability comes
70 >> > next. Optimization comes last. Unless you are creating a fancy 3D engine
71 >> for
72 >> > kung fu game.
73 >> >
74 >> > If you are going to exclude overlays, you are removing functionality -
75 >> and,
76 >> > indeed, absolutely has-to-be-there functionality, because noone would
77 >> > intuitively expect search function to search only one subset of
78 >> packages,
79 >> > however reasonable this subset would be. So, you can't, just can't, add
80 >> this
81 >> > package into portage base - you could write just another external search
82 >> > package for portage.
83 >> >
84 >> > I looked this code a bit and:
85 >> > Portage's "__init__.py" contains comment "# search functionality". After
86 >> > this comment, there is a nice and simple search class.
87 >> > It also contains method "def action_sync(...)", which contains
88 >> > synchronization stuff.
89 >> >
90 >> > Now, search class will be initialized by setting up 3 databases -
91 >> porttree,
92 >> > bintree and vartree, whatever those are. Those will be in self._dbs
93 >> array
94 >> > and porttree will be in self._portdb.
95 >> >
96 >> > It contains some more methods:
97 >> > _findname(...) will return result of self._portdb.findname(...) with
98 >> same
99 >> > parameters or None if it does not exist.
100 >> > Other methods will do similar things - map one or another method.
101 >> > execute will do the real search...
102 >> > Now - "for package in self.portdb.cp_all()" is important here ...it
103 >> > currently loops over whole portage tree. All kinds of matching will be
104 >> done
105 >> > inside.
106 >> > self.portdb obviously points to porttree.py (unless it points to fake
107 >> tree).
108 >> > cp_all will take all porttrees and do simple file search inside. This
109 >> method
110 >> > should contain optional index search.
111 >> >
112 >> > self.porttrees = [self.porttree_root] + \
113 >> > [os.path.realpath(t) for t in
114 >> self.mysettings["PORTDIR_OVERLAY"].split()]
115 >> >
116 >> > So, self.porttrees contains list of trees - first of them is root,
117 >> others
118 >> > are overlays.
119 >> >
120 >> > Now, what you have to do will not be harder just because of having
121 >> overlay
122 >> > search, too.
123 >> >
124 >> > You have to create method def cp_index(self), which will return
125 >> dictionary
126 >> > containing package names as keys. For oroot... will be
127 >> "self.porttrees[1:]",
128 >> > not "self.porttrees" - this will only search overlays. d = {} will be
129 >> > replaced with d = self.cp_index(). If index is not there, old version
130 >> will
131 >> > be used (thus, you have to make internal porttrees variable, which
132 >> contains
133 >> > all or all except first).
134 >> >
135 >> > Other methods used by search are xmatch and aux_get - first used several
136 >> > times and last one used to get description. You have to cache results of
137 >> > those specific queries and make them use your cache - as you can see,
138 >> those
139 >> > parts of portage are already able to use overlays. Thus, you have to put
140 >> > your code again in beginning of those functions - create index_xmatch
141 >> and
142 >> > index_aux_get methods, then make those methods use them and return their
143 >> > results unless those are None (or something other in case none is
144 >> already
145 >> > legal result) - if they return None, old code will be run and do it's
146 >> job.
147 >> > If index is not created, result is None. In index_** methods, just check
148 >> if
149 >> > query is what you can answer and if it is, then answer it.
150 >> >
151 >> > Obviously, the simplest way to create your index is to delete index,
152 >> then
153 >> > use those same methods to query for all nessecary information - and
154 >> fastest
155 >> > way would be to add updating index directly into sync, which you could
156 >> do
157 >> > later.
158 >> >
159 >> > Please, also, make those commands to turn index on and off (last one
160 >> should
161 >> > also delete it to save disk space). Default should be off until it's
162 >> fast,
163 >> > small and reliable. Also notice that if index is kept on hard drive, it
164 >> > might be faster if it's compressed (gz, for example) - decompressing
165 >> takes
166 >> > less time and more processing power than reading it fully out.
167 >>
168 >> I'm pretty sure your mistaken here, unless your index is stored on a
169 >> floppy or something really slow.
170 >>
171 >> A disk read has 2 primary costs.
172 >>
173 >> Seek Time: Time for the head to seek to the sector of disk you want.
174 >> Spin Time: Time for the platter to spin around such that the sector
175 >> you want is under the read head.
176 >>
177 >> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120
178 >> rotations per second, so worst case (you just passed the sector you
179 >> need) you need to wait 1/120th of a second (or 8ms).
180 >>
181 >> Seek Time is per hard drive, but most drives provide average seek
182 >> times under 10ms.
183 >>
184 >> So it takes on average 18ms to get to your data, then you start
185 >> reading. The index will not be that large (my esearchdb is 2 megs,
186 >> but lets assume 10MB for this compressed index).
187 >>
188 >> I took a 10MB meg sqlite database and compressed it with gzip (default
189 >> settings) down to 5 megs.
190 >> gzip -d on the database takes 300ms, catting the decompressed data
191 >> base takes 88ms (average of 5 runs, drop disk caches between runs).
192 >>
193 >> I then tried my vdb_metadata.pickle from
194 >> /var/cache/edb/vdb_metadata.pickle
195 >>
196 >> 1.3megs compresses to 390k.
197 >>
198 >> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from
199 >> disk.
200 >>
201 >> Your index would have to be very large or very fragmented on disk
202 >> (requiring more than one seek) to see a significant gain in file
203 >> compression (gzip scales linearly).
204 >>
205 >> In short, don't compress the index ;p
206 >>
207 >> >
208 >> > Have luck!
209 >> >
210 >> >>> -----BEGIN PGP SIGNED MESSAGE-----
211 >> >>> Hash: SHA1
212 >> >>>
213 >> >>> Emma Strubell schrieb:
214 >> >>> > 2) does anyone really need to search an overlay anyway?
215 >> >>>
216 >> >>> Of course. Take large (semi-)official overlays like sunrise. They can
217 >> >>> easily be seen as a second portage tree.
218 >> >>> -----BEGIN PGP SIGNATURE-----
219 >> >>> Version: GnuPG v2.0.9 (GNU/Linux)
220 >> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
221 >> >>>
222 >> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt
223 >> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S
224 >> >>> =+lCO
225 >> >>> -----END PGP SIGNATURE-----
226 >> >>>
227 >> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@××××××.eu
228 >> >
229 >> >> wrote:
230 >> >>
231 >> >
232 >> >
233 >>
234 >
235 >