1 |
To remove this fuzz maybe some useflags would be possible - for example, |
2 |
useflag which makes portage use SQLite. Then, it's db can used for indexing |
3 |
and portage searches could be implemented as SQL queries. SQLite then does |
4 |
what it can to make it fast and disk cache will make sure that for |
5 |
simultaneous searches it's kept in memory. |
6 |
|
7 |
|
8 |
Just some thoughts on compression... |
9 |
|
10 |
http://en.wikipedia.org/wiki/Hard_disk 1.6GBit/s |
11 |
http://www.dewassoc.com/performance/memory/memory_speeds.htm 1.6GB/s |
12 |
So the *fastest *hard drives are 8 times slower, than the *fastest *memory |
13 |
devices in pair. |
14 |
|
15 |
Anyway, we live in the real world... |
16 |
http://www.amazon.com/Maxtor-L01P200-Internal-7200-Drive/dp/B00007KVHZ up to |
17 |
133MB/s |
18 |
Therefore, the peak bandwidth for PC-266 DDR is 266 MHz x 8 Bytes = 2.1GB/s |
19 |
|
20 |
Now, be fair - if you can read 133MB/s out of hard drive, it means 15 ms for |
21 |
2MB file without seek times and partitioning. At the same time, you can |
22 |
read/write about 31.5MB of data in memory, doing different operations with |
23 |
it. If your compression algorithm has dictionary of repeated words, which |
24 |
repeat enough times to make actual file size 0.5MB smaller, you win 7.5ms |
25 |
and loose about 1ms for seeking this thing. Processor usage might be |
26 |
somewhat bigger, but you really don't care about processor usage when |
27 |
searching packages and it's not *that* big. |
28 |
|
29 |
What it really gives us is ability to make our file format highly |
30 |
inefficient in terms of disk space - for example, what I mentioned before |
31 |
was 65536 64-bit integers (there was a mistake as I said they were 16 byte, |
32 |
they were 8 bytes long). Most of them started with at least 32 bits of |
33 |
zeros, which were wiped out by compression lib, but which I needed to do |
34 |
fast operations. It doesnt matter much, how data is stored, but there are |
35 |
repeated patterns if you want to make it fast. |
36 |
|
37 |
I really don't know if pickle has some good file format included or not. For |
38 |
common pickle files, I think that compression is noticeable. Anyway, using |
39 |
database, for example, would be clever idea and making keyword search fully |
40 |
SQL-query specific makes it clearly better and removes chance to compress. |
41 |
|
42 |
Tambet - technique evolves to art, art evolves to magic, magic evolves to |
43 |
just doing. |
44 |
|
45 |
|
46 |
2008/12/2 Alec Warner <antarus@g.o> |
47 |
|
48 |
> On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@×××××.com> wrote: |
49 |
> > 2008/12/2 Emma Strubell <emma.strubell@×××××.com> |
50 |
> >> |
51 |
> >> True, true. Like I said, I don't really use overlays, so excuse my |
52 |
> >> igonrance. |
53 |
> > |
54 |
> > Do you know an order of doing things: |
55 |
> > |
56 |
> > Rules of Optimization: |
57 |
> > |
58 |
> > Rule 1: Don't do it. |
59 |
> > Rule 2 (for experts only): Don't do it yet. |
60 |
> > |
61 |
> > What this actually means - functionality comes first. Readability comes |
62 |
> > next. Optimization comes last. Unless you are creating a fancy 3D engine |
63 |
> for |
64 |
> > kung fu game. |
65 |
> > |
66 |
> > If you are going to exclude overlays, you are removing functionality - |
67 |
> and, |
68 |
> > indeed, absolutely has-to-be-there functionality, because noone would |
69 |
> > intuitively expect search function to search only one subset of packages, |
70 |
> > however reasonable this subset would be. So, you can't, just can't, add |
71 |
> this |
72 |
> > package into portage base - you could write just another external search |
73 |
> > package for portage. |
74 |
> > |
75 |
> > I looked this code a bit and: |
76 |
> > Portage's "__init__.py" contains comment "# search functionality". After |
77 |
> > this comment, there is a nice and simple search class. |
78 |
> > It also contains method "def action_sync(...)", which contains |
79 |
> > synchronization stuff. |
80 |
> > |
81 |
> > Now, search class will be initialized by setting up 3 databases - |
82 |
> porttree, |
83 |
> > bintree and vartree, whatever those are. Those will be in self._dbs array |
84 |
> > and porttree will be in self._portdb. |
85 |
> > |
86 |
> > It contains some more methods: |
87 |
> > _findname(...) will return result of self._portdb.findname(...) with same |
88 |
> > parameters or None if it does not exist. |
89 |
> > Other methods will do similar things - map one or another method. |
90 |
> > execute will do the real search... |
91 |
> > Now - "for package in self.portdb.cp_all()" is important here ...it |
92 |
> > currently loops over whole portage tree. All kinds of matching will be |
93 |
> done |
94 |
> > inside. |
95 |
> > self.portdb obviously points to porttree.py (unless it points to fake |
96 |
> tree). |
97 |
> > cp_all will take all porttrees and do simple file search inside. This |
98 |
> method |
99 |
> > should contain optional index search. |
100 |
> > |
101 |
> > self.porttrees = [self.porttree_root] + \ |
102 |
> > [os.path.realpath(t) for t in |
103 |
> self.mysettings["PORTDIR_OVERLAY"].split()] |
104 |
> > |
105 |
> > So, self.porttrees contains list of trees - first of them is root, others |
106 |
> > are overlays. |
107 |
> > |
108 |
> > Now, what you have to do will not be harder just because of having |
109 |
> overlay |
110 |
> > search, too. |
111 |
> > |
112 |
> > You have to create method def cp_index(self), which will return |
113 |
> dictionary |
114 |
> > containing package names as keys. For oroot... will be |
115 |
> "self.porttrees[1:]", |
116 |
> > not "self.porttrees" - this will only search overlays. d = {} will be |
117 |
> > replaced with d = self.cp_index(). If index is not there, old version |
118 |
> will |
119 |
> > be used (thus, you have to make internal porttrees variable, which |
120 |
> contains |
121 |
> > all or all except first). |
122 |
> > |
123 |
> > Other methods used by search are xmatch and aux_get - first used several |
124 |
> > times and last one used to get description. You have to cache results of |
125 |
> > those specific queries and make them use your cache - as you can see, |
126 |
> those |
127 |
> > parts of portage are already able to use overlays. Thus, you have to put |
128 |
> > your code again in beginning of those functions - create index_xmatch and |
129 |
> > index_aux_get methods, then make those methods use them and return their |
130 |
> > results unless those are None (or something other in case none is already |
131 |
> > legal result) - if they return None, old code will be run and do it's |
132 |
> job. |
133 |
> > If index is not created, result is None. In index_** methods, just check |
134 |
> if |
135 |
> > query is what you can answer and if it is, then answer it. |
136 |
> > |
137 |
> > Obviously, the simplest way to create your index is to delete index, then |
138 |
> > use those same methods to query for all nessecary information - and |
139 |
> fastest |
140 |
> > way would be to add updating index directly into sync, which you could do |
141 |
> > later. |
142 |
> > |
143 |
> > Please, also, make those commands to turn index on and off (last one |
144 |
> should |
145 |
> > also delete it to save disk space). Default should be off until it's |
146 |
> fast, |
147 |
> > small and reliable. Also notice that if index is kept on hard drive, it |
148 |
> > might be faster if it's compressed (gz, for example) - decompressing |
149 |
> takes |
150 |
> > less time and more processing power than reading it fully out. |
151 |
> |
152 |
> I'm pretty sure your mistaken here, unless your index is stored on a |
153 |
> floppy or something really slow. |
154 |
> |
155 |
> A disk read has 2 primary costs. |
156 |
> |
157 |
> Seek Time: Time for the head to seek to the sector of disk you want. |
158 |
> Spin Time: Time for the platter to spin around such that the sector |
159 |
> you want is under the read head. |
160 |
> |
161 |
> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120 |
162 |
> rotations per second, so worst case (you just passed the sector you |
163 |
> need) you need to wait 1/120th of a second (or 8ms). |
164 |
> |
165 |
> Seek Time is per hard drive, but most drives provide average seek |
166 |
> times under 10ms. |
167 |
> |
168 |
> So it takes on average 18ms to get to your data, then you start |
169 |
> reading. The index will not be that large (my esearchdb is 2 megs, |
170 |
> but lets assume 10MB for this compressed index). |
171 |
> |
172 |
> I took a 10MB meg sqlite database and compressed it with gzip (default |
173 |
> settings) down to 5 megs. |
174 |
> gzip -d on the database takes 300ms, catting the decompressed data |
175 |
> base takes 88ms (average of 5 runs, drop disk caches between runs). |
176 |
> |
177 |
> I then tried my vdb_metadata.pickle from /var/cache/edb/vdb_metadata.pickle |
178 |
> |
179 |
> 1.3megs compresses to 390k. |
180 |
> |
181 |
> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from |
182 |
> disk. |
183 |
> |
184 |
> Your index would have to be very large or very fragmented on disk |
185 |
> (requiring more than one seek) to see a significant gain in file |
186 |
> compression (gzip scales linearly). |
187 |
> |
188 |
> In short, don't compress the index ;p |
189 |
> |
190 |
> > |
191 |
> > Have luck! |
192 |
> > |
193 |
> >>> -----BEGIN PGP SIGNED MESSAGE----- |
194 |
> >>> Hash: SHA1 |
195 |
> >>> |
196 |
> >>> Emma Strubell schrieb: |
197 |
> >>> > 2) does anyone really need to search an overlay anyway? |
198 |
> >>> |
199 |
> >>> Of course. Take large (semi-)official overlays like sunrise. They can |
200 |
> >>> easily be seen as a second portage tree. |
201 |
> >>> -----BEGIN PGP SIGNATURE----- |
202 |
> >>> Version: GnuPG v2.0.9 (GNU/Linux) |
203 |
> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org |
204 |
> >>> |
205 |
> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt |
206 |
> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S |
207 |
> >>> =+lCO |
208 |
> >>> -----END PGP SIGNATURE----- |
209 |
> >>> |
210 |
> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@××××××.eu> |
211 |
> >> wrote: |
212 |
> >> |
213 |
> > |
214 |
> > |
215 |
> |