Gentoo Archives: gentoo-portage-dev

From: Alec Warner <antarus@g.o>
To: gentoo-portage-dev@l.g.o
Subject: Re: [gentoo-portage-dev] Re: search functionality in emerge
Date: Tue, 02 Dec 2008 10:21:10
Message-Id: b41005390812020221s3645d862t62866ab2d62ce85@mail.gmail.com
In Reply to: Re: [gentoo-portage-dev] Re: search functionality in emerge by Tambet
1 On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@×××××.com> wrote:
2 > 2008/12/2 Emma Strubell <emma.strubell@×××××.com>
3 >>
4 >> True, true. Like I said, I don't really use overlays, so excuse my
5 >> igonrance.
6 >
7 > Do you know an order of doing things:
8 >
9 > Rules of Optimization:
10 >
11 > Rule 1: Don't do it.
12 > Rule 2 (for experts only): Don't do it yet.
13 >
14 > What this actually means - functionality comes first. Readability comes
15 > next. Optimization comes last. Unless you are creating a fancy 3D engine for
16 > kung fu game.
17 >
18 > If you are going to exclude overlays, you are removing functionality - and,
19 > indeed, absolutely has-to-be-there functionality, because noone would
20 > intuitively expect search function to search only one subset of packages,
21 > however reasonable this subset would be. So, you can't, just can't, add this
22 > package into portage base - you could write just another external search
23 > package for portage.
24 >
25 > I looked this code a bit and:
26 > Portage's "__init__.py" contains comment "# search functionality". After
27 > this comment, there is a nice and simple search class.
28 > It also contains method "def action_sync(...)", which contains
29 > synchronization stuff.
30 >
31 > Now, search class will be initialized by setting up 3 databases - porttree,
32 > bintree and vartree, whatever those are. Those will be in self._dbs array
33 > and porttree will be in self._portdb.
34 >
35 > It contains some more methods:
36 > _findname(...) will return result of self._portdb.findname(...) with same
37 > parameters or None if it does not exist.
38 > Other methods will do similar things - map one or another method.
39 > execute will do the real search...
40 > Now - "for package in self.portdb.cp_all()" is important here ...it
41 > currently loops over whole portage tree. All kinds of matching will be done
42 > inside.
43 > self.portdb obviously points to porttree.py (unless it points to fake tree).
44 > cp_all will take all porttrees and do simple file search inside. This method
45 > should contain optional index search.
46 >
47 > self.porttrees = [self.porttree_root] + \
48 > [os.path.realpath(t) for t in self.mysettings["PORTDIR_OVERLAY"].split()]
49 >
50 > So, self.porttrees contains list of trees - first of them is root, others
51 > are overlays.
52 >
53 > Now, what you have to do will not be harder just because of having overlay
54 > search, too.
55 >
56 > You have to create method def cp_index(self), which will return dictionary
57 > containing package names as keys. For oroot... will be "self.porttrees[1:]",
58 > not "self.porttrees" - this will only search overlays. d = {} will be
59 > replaced with d = self.cp_index(). If index is not there, old version will
60 > be used (thus, you have to make internal porttrees variable, which contains
61 > all or all except first).
62 >
63 > Other methods used by search are xmatch and aux_get - first used several
64 > times and last one used to get description. You have to cache results of
65 > those specific queries and make them use your cache - as you can see, those
66 > parts of portage are already able to use overlays. Thus, you have to put
67 > your code again in beginning of those functions - create index_xmatch and
68 > index_aux_get methods, then make those methods use them and return their
69 > results unless those are None (or something other in case none is already
70 > legal result) - if they return None, old code will be run and do it's job.
71 > If index is not created, result is None. In index_** methods, just check if
72 > query is what you can answer and if it is, then answer it.
73 >
74 > Obviously, the simplest way to create your index is to delete index, then
75 > use those same methods to query for all nessecary information - and fastest
76 > way would be to add updating index directly into sync, which you could do
77 > later.
78 >
79 > Please, also, make those commands to turn index on and off (last one should
80 > also delete it to save disk space). Default should be off until it's fast,
81 > small and reliable. Also notice that if index is kept on hard drive, it
82 > might be faster if it's compressed (gz, for example) - decompressing takes
83 > less time and more processing power than reading it fully out.
84
85 I'm pretty sure your mistaken here, unless your index is stored on a
86 floppy or something really slow.
87
88 A disk read has 2 primary costs.
89
90 Seek Time: Time for the head to seek to the sector of disk you want.
91 Spin Time: Time for the platter to spin around such that the sector
92 you want is under the read head.
93
94 Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120
95 rotations per second, so worst case (you just passed the sector you
96 need) you need to wait 1/120th of a second (or 8ms).
97
98 Seek Time is per hard drive, but most drives provide average seek
99 times under 10ms.
100
101 So it takes on average 18ms to get to your data, then you start
102 reading. The index will not be that large (my esearchdb is 2 megs,
103 but lets assume 10MB for this compressed index).
104
105 I took a 10MB meg sqlite database and compressed it with gzip (default
106 settings) down to 5 megs.
107 gzip -d on the database takes 300ms, catting the decompressed data
108 base takes 88ms (average of 5 runs, drop disk caches between runs).
109
110 I then tried my vdb_metadata.pickle from /var/cache/edb/vdb_metadata.pickle
111
112 1.3megs compresses to 390k.
113
114 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from disk.
115
116 Your index would have to be very large or very fragmented on disk
117 (requiring more than one seek) to see a significant gain in file
118 compression (gzip scales linearly).
119
120 In short, don't compress the index ;p
121
122 >
123 > Have luck!
124 >
125 >>> -----BEGIN PGP SIGNED MESSAGE-----
126 >>> Hash: SHA1
127 >>>
128 >>> Emma Strubell schrieb:
129 >>> > 2) does anyone really need to search an overlay anyway?
130 >>>
131 >>> Of course. Take large (semi-)official overlays like sunrise. They can
132 >>> easily be seen as a second portage tree.
133 >>> -----BEGIN PGP SIGNATURE-----
134 >>> Version: GnuPG v2.0.9 (GNU/Linux)
135 >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
136 >>>
137 >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt
138 >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S
139 >>> =+lCO
140 >>> -----END PGP SIGNATURE-----
141 >>>
142 >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@××××××.eu>
143 >> wrote:
144 >>
145 >
146 >

Replies

Subject Author
Re: [gentoo-portage-dev] Re: search functionality in emerge Tambet <qtvali@×××××.com>
Re: [gentoo-portage-dev] Re: search functionality in emerge Tambet <qtvali@×××××.com>