1 |
On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@×××××.com> wrote: |
2 |
> 2008/12/2 Emma Strubell <emma.strubell@×××××.com> |
3 |
>> |
4 |
>> True, true. Like I said, I don't really use overlays, so excuse my |
5 |
>> igonrance. |
6 |
> |
7 |
> Do you know an order of doing things: |
8 |
> |
9 |
> Rules of Optimization: |
10 |
> |
11 |
> Rule 1: Don't do it. |
12 |
> Rule 2 (for experts only): Don't do it yet. |
13 |
> |
14 |
> What this actually means - functionality comes first. Readability comes |
15 |
> next. Optimization comes last. Unless you are creating a fancy 3D engine for |
16 |
> kung fu game. |
17 |
> |
18 |
> If you are going to exclude overlays, you are removing functionality - and, |
19 |
> indeed, absolutely has-to-be-there functionality, because noone would |
20 |
> intuitively expect search function to search only one subset of packages, |
21 |
> however reasonable this subset would be. So, you can't, just can't, add this |
22 |
> package into portage base - you could write just another external search |
23 |
> package for portage. |
24 |
> |
25 |
> I looked this code a bit and: |
26 |
> Portage's "__init__.py" contains comment "# search functionality". After |
27 |
> this comment, there is a nice and simple search class. |
28 |
> It also contains method "def action_sync(...)", which contains |
29 |
> synchronization stuff. |
30 |
> |
31 |
> Now, search class will be initialized by setting up 3 databases - porttree, |
32 |
> bintree and vartree, whatever those are. Those will be in self._dbs array |
33 |
> and porttree will be in self._portdb. |
34 |
> |
35 |
> It contains some more methods: |
36 |
> _findname(...) will return result of self._portdb.findname(...) with same |
37 |
> parameters or None if it does not exist. |
38 |
> Other methods will do similar things - map one or another method. |
39 |
> execute will do the real search... |
40 |
> Now - "for package in self.portdb.cp_all()" is important here ...it |
41 |
> currently loops over whole portage tree. All kinds of matching will be done |
42 |
> inside. |
43 |
> self.portdb obviously points to porttree.py (unless it points to fake tree). |
44 |
> cp_all will take all porttrees and do simple file search inside. This method |
45 |
> should contain optional index search. |
46 |
> |
47 |
> self.porttrees = [self.porttree_root] + \ |
48 |
> [os.path.realpath(t) for t in self.mysettings["PORTDIR_OVERLAY"].split()] |
49 |
> |
50 |
> So, self.porttrees contains list of trees - first of them is root, others |
51 |
> are overlays. |
52 |
> |
53 |
> Now, what you have to do will not be harder just because of having overlay |
54 |
> search, too. |
55 |
> |
56 |
> You have to create method def cp_index(self), which will return dictionary |
57 |
> containing package names as keys. For oroot... will be "self.porttrees[1:]", |
58 |
> not "self.porttrees" - this will only search overlays. d = {} will be |
59 |
> replaced with d = self.cp_index(). If index is not there, old version will |
60 |
> be used (thus, you have to make internal porttrees variable, which contains |
61 |
> all or all except first). |
62 |
> |
63 |
> Other methods used by search are xmatch and aux_get - first used several |
64 |
> times and last one used to get description. You have to cache results of |
65 |
> those specific queries and make them use your cache - as you can see, those |
66 |
> parts of portage are already able to use overlays. Thus, you have to put |
67 |
> your code again in beginning of those functions - create index_xmatch and |
68 |
> index_aux_get methods, then make those methods use them and return their |
69 |
> results unless those are None (or something other in case none is already |
70 |
> legal result) - if they return None, old code will be run and do it's job. |
71 |
> If index is not created, result is None. In index_** methods, just check if |
72 |
> query is what you can answer and if it is, then answer it. |
73 |
> |
74 |
> Obviously, the simplest way to create your index is to delete index, then |
75 |
> use those same methods to query for all nessecary information - and fastest |
76 |
> way would be to add updating index directly into sync, which you could do |
77 |
> later. |
78 |
> |
79 |
> Please, also, make those commands to turn index on and off (last one should |
80 |
> also delete it to save disk space). Default should be off until it's fast, |
81 |
> small and reliable. Also notice that if index is kept on hard drive, it |
82 |
> might be faster if it's compressed (gz, for example) - decompressing takes |
83 |
> less time and more processing power than reading it fully out. |
84 |
|
85 |
I'm pretty sure your mistaken here, unless your index is stored on a |
86 |
floppy or something really slow. |
87 |
|
88 |
A disk read has 2 primary costs. |
89 |
|
90 |
Seek Time: Time for the head to seek to the sector of disk you want. |
91 |
Spin Time: Time for the platter to spin around such that the sector |
92 |
you want is under the read head. |
93 |
|
94 |
Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120 |
95 |
rotations per second, so worst case (you just passed the sector you |
96 |
need) you need to wait 1/120th of a second (or 8ms). |
97 |
|
98 |
Seek Time is per hard drive, but most drives provide average seek |
99 |
times under 10ms. |
100 |
|
101 |
So it takes on average 18ms to get to your data, then you start |
102 |
reading. The index will not be that large (my esearchdb is 2 megs, |
103 |
but lets assume 10MB for this compressed index). |
104 |
|
105 |
I took a 10MB meg sqlite database and compressed it with gzip (default |
106 |
settings) down to 5 megs. |
107 |
gzip -d on the database takes 300ms, catting the decompressed data |
108 |
base takes 88ms (average of 5 runs, drop disk caches between runs). |
109 |
|
110 |
I then tried my vdb_metadata.pickle from /var/cache/edb/vdb_metadata.pickle |
111 |
|
112 |
1.3megs compresses to 390k. |
113 |
|
114 |
36ms to decompress the 390k file, but 26ms to read the 1.3meg file from disk. |
115 |
|
116 |
Your index would have to be very large or very fragmented on disk |
117 |
(requiring more than one seek) to see a significant gain in file |
118 |
compression (gzip scales linearly). |
119 |
|
120 |
In short, don't compress the index ;p |
121 |
|
122 |
> |
123 |
> Have luck! |
124 |
> |
125 |
>>> -----BEGIN PGP SIGNED MESSAGE----- |
126 |
>>> Hash: SHA1 |
127 |
>>> |
128 |
>>> Emma Strubell schrieb: |
129 |
>>> > 2) does anyone really need to search an overlay anyway? |
130 |
>>> |
131 |
>>> Of course. Take large (semi-)official overlays like sunrise. They can |
132 |
>>> easily be seen as a second portage tree. |
133 |
>>> -----BEGIN PGP SIGNATURE----- |
134 |
>>> Version: GnuPG v2.0.9 (GNU/Linux) |
135 |
>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org |
136 |
>>> |
137 |
>>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt |
138 |
>>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S |
139 |
>>> =+lCO |
140 |
>>> -----END PGP SIGNATURE----- |
141 |
>>> |
142 |
>> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@××××××.eu> |
143 |
>> wrote: |
144 |
>> |
145 |
> |
146 |
> |