Gentoo Archives: gentoo-portage-dev

From: Tambet <qtvali@×××××.com>
To: gentoo-portage-dev@l.g.o
Subject: Re: [gentoo-portage-dev] Re: search functionality in emerge
Date: Tue, 02 Dec 2008 12:42:20
Message-Id: cea53e3c0812020442s6fcb36ao7e8fcee400a3370b@mail.gmail.com
In Reply to: Re: [gentoo-portage-dev] Re: search functionality in emerge by Alec Warner
1 About zipping.. Default settings might not really be good idea - i think
2 that "fastest" might be even better. Considering that portage tree contains
3 same word again and again (like "applications") it needs pretty small
4 dictionary to make it much smaller. Decompressing will not be reading from
5 disc, decompressing and writing back to disc as in your case probably - try
6 decompression to memory drive and you might get better numbers.
7
8 I have personally used compression in one c++ application and with optimum
9 settings, it made things much faster - those were files, where i had for
10 example 65536 16-byte integers, which could be zeros and mostly were; I
11 didnt care about creating better file format, but just compressed the whole
12 thing.
13
14 I suggest you to compress esearch db, then decompress it to memory drive and
15 give us those numbers - might be considerably faster.
16
17 http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip support.
18 Try open of that and normal open on esearch db; also compress with the same
19 lib to get right kind of file.
20
21 Anyway - maybe this compression should be later added and optional.
22
23 Tambet - technique evolves to art, art evolves to magic, magic evolves to
24 just doing.
25
26
27 2008/12/2 Alec Warner <antarus@g.o>
28
29 > On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@×××××.com> wrote:
30 > > 2008/12/2 Emma Strubell <emma.strubell@×××××.com>
31 > >>
32 > >> True, true. Like I said, I don't really use overlays, so excuse my
33 > >> igonrance.
34 > >
35 > > Do you know an order of doing things:
36 > >
37 > > Rules of Optimization:
38 > >
39 > > Rule 1: Don't do it.
40 > > Rule 2 (for experts only): Don't do it yet.
41 > >
42 > > What this actually means - functionality comes first. Readability comes
43 > > next. Optimization comes last. Unless you are creating a fancy 3D engine
44 > for
45 > > kung fu game.
46 > >
47 > > If you are going to exclude overlays, you are removing functionality -
48 > and,
49 > > indeed, absolutely has-to-be-there functionality, because noone would
50 > > intuitively expect search function to search only one subset of packages,
51 > > however reasonable this subset would be. So, you can't, just can't, add
52 > this
53 > > package into portage base - you could write just another external search
54 > > package for portage.
55 > >
56 > > I looked this code a bit and:
57 > > Portage's "__init__.py" contains comment "# search functionality". After
58 > > this comment, there is a nice and simple search class.
59 > > It also contains method "def action_sync(...)", which contains
60 > > synchronization stuff.
61 > >
62 > > Now, search class will be initialized by setting up 3 databases -
63 > porttree,
64 > > bintree and vartree, whatever those are. Those will be in self._dbs array
65 > > and porttree will be in self._portdb.
66 > >
67 > > It contains some more methods:
68 > > _findname(...) will return result of self._portdb.findname(...) with same
69 > > parameters or None if it does not exist.
70 > > Other methods will do similar things - map one or another method.
71 > > execute will do the real search...
72 > > Now - "for package in self.portdb.cp_all()" is important here ...it
73 > > currently loops over whole portage tree. All kinds of matching will be
74 > done
75 > > inside.
76 > > self.portdb obviously points to porttree.py (unless it points to fake
77 > tree).
78 > > cp_all will take all porttrees and do simple file search inside. This
79 > method
80 > > should contain optional index search.
81 > >
82 > > self.porttrees = [self.porttree_root] + \
83 > > [os.path.realpath(t) for t in
84 > self.mysettings["PORTDIR_OVERLAY"].split()]
85 > >
86 > > So, self.porttrees contains list of trees - first of them is root, others
87 > > are overlays.
88 > >
89 > > Now, what you have to do will not be harder just because of having
90 > overlay
91 > > search, too.
92 > >
93 > > You have to create method def cp_index(self), which will return
94 > dictionary
95 > > containing package names as keys. For oroot... will be
96 > "self.porttrees[1:]",
97 > > not "self.porttrees" - this will only search overlays. d = {} will be
98 > > replaced with d = self.cp_index(). If index is not there, old version
99 > will
100 > > be used (thus, you have to make internal porttrees variable, which
101 > contains
102 > > all or all except first).
103 > >
104 > > Other methods used by search are xmatch and aux_get - first used several
105 > > times and last one used to get description. You have to cache results of
106 > > those specific queries and make them use your cache - as you can see,
107 > those
108 > > parts of portage are already able to use overlays. Thus, you have to put
109 > > your code again in beginning of those functions - create index_xmatch and
110 > > index_aux_get methods, then make those methods use them and return their
111 > > results unless those are None (or something other in case none is already
112 > > legal result) - if they return None, old code will be run and do it's
113 > job.
114 > > If index is not created, result is None. In index_** methods, just check
115 > if
116 > > query is what you can answer and if it is, then answer it.
117 > >
118 > > Obviously, the simplest way to create your index is to delete index, then
119 > > use those same methods to query for all nessecary information - and
120 > fastest
121 > > way would be to add updating index directly into sync, which you could do
122 > > later.
123 > >
124 > > Please, also, make those commands to turn index on and off (last one
125 > should
126 > > also delete it to save disk space). Default should be off until it's
127 > fast,
128 > > small and reliable. Also notice that if index is kept on hard drive, it
129 > > might be faster if it's compressed (gz, for example) - decompressing
130 > takes
131 > > less time and more processing power than reading it fully out.
132 >
133 > I'm pretty sure your mistaken here, unless your index is stored on a
134 > floppy or something really slow.
135 >
136 > A disk read has 2 primary costs.
137 >
138 > Seek Time: Time for the head to seek to the sector of disk you want.
139 > Spin Time: Time for the platter to spin around such that the sector
140 > you want is under the read head.
141 >
142 > Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120
143 > rotations per second, so worst case (you just passed the sector you
144 > need) you need to wait 1/120th of a second (or 8ms).
145 >
146 > Seek Time is per hard drive, but most drives provide average seek
147 > times under 10ms.
148 >
149 > So it takes on average 18ms to get to your data, then you start
150 > reading. The index will not be that large (my esearchdb is 2 megs,
151 > but lets assume 10MB for this compressed index).
152 >
153 > I took a 10MB meg sqlite database and compressed it with gzip (default
154 > settings) down to 5 megs.
155 > gzip -d on the database takes 300ms, catting the decompressed data
156 > base takes 88ms (average of 5 runs, drop disk caches between runs).
157 >
158 > I then tried my vdb_metadata.pickle from /var/cache/edb/vdb_metadata.pickle
159 >
160 > 1.3megs compresses to 390k.
161 >
162 > 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from
163 > disk.
164 >
165 > Your index would have to be very large or very fragmented on disk
166 > (requiring more than one seek) to see a significant gain in file
167 > compression (gzip scales linearly).
168 >
169 > In short, don't compress the index ;p
170 >
171 > >
172 > > Have luck!
173 > >
174 > >>> -----BEGIN PGP SIGNED MESSAGE-----
175 > >>> Hash: SHA1
176 > >>>
177 > >>> Emma Strubell schrieb:
178 > >>> > 2) does anyone really need to search an overlay anyway?
179 > >>>
180 > >>> Of course. Take large (semi-)official overlays like sunrise. They can
181 > >>> easily be seen as a second portage tree.
182 > >>> -----BEGIN PGP SIGNATURE-----
183 > >>> Version: GnuPG v2.0.9 (GNU/Linux)
184 > >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
185 > >>>
186 > >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt
187 > >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S
188 > >>> =+lCO
189 > >>> -----END PGP SIGNATURE-----
190 > >>>
191 > >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@××××××.eu>
192 > >> wrote:
193 > >>
194 > >
195 > >
196 >

Replies