1 |
On Tue, Dec 2, 2008 at 4:42 AM, Tambet <qtvali@×××××.com> wrote: |
2 |
> About zipping.. Default settings might not really be good idea - i think |
3 |
> that "fastest" might be even better. Considering that portage tree contains |
4 |
> same word again and again (like "applications") it needs pretty small |
5 |
> dictionary to make it much smaller. Decompressing will not be reading from |
6 |
> disc, decompressing and writing back to disc as in your case probably - try |
7 |
> decompression to memory drive and you might get better numbers. |
8 |
|
9 |
I ran gzip -d -c file.gz > /dev/null, which should not write to disk. |
10 |
|
11 |
I tried again with gzip -1 and it still takes 29ms to decompress (even |
12 |
with gzip -1) where a bare read takes 26ms. (I have a 2.6Ghz X2 which |
13 |
is probably relevant to gzip decompression speed) |
14 |
|
15 |
> |
16 |
> I have personally used compression in one c++ application and with optimum |
17 |
> settings, it made things much faster - those were files, where i had for |
18 |
> example 65536 16-byte integers, which could be zeros and mostly were; I |
19 |
> didnt care about creating better file format, but just compressed the whole |
20 |
> thing. |
21 |
|
22 |
I'm not saying compression won't make the index smaller. I'm saying |
23 |
making the index smaller does not improve performance. If you have a |
24 |
10 meg file and you make it 1 meg, you do not increase performance |
25 |
because you (on average) are not saving enough time reading the |
26 |
smaller file; since you pay it in decompressing the smaller file |
27 |
later. |
28 |
|
29 |
> |
30 |
> I suggest you to compress esearch db, then decompress it to memory drive and |
31 |
> give us those numbers - might be considerably faster. |
32 |
|
33 |
gzip -d -c esearchdb.py.gz > /dev/null (compressed with gzip -1) takes |
34 |
on average (6 trials, dropped caches between trials) takes 35.1666ms |
35 |
|
36 |
cat esearchdb.py > /dev/null (uncompressed) takes on average of 6 trials, 24ms. |
37 |
|
38 |
The point is you use compression when you need to save space (sending |
39 |
data over the network, or storing large amounts of data or a lot of |
40 |
something). The index isn't going to be big (if it is bigger than 20 |
41 |
or 30 meg I'll be surprised), the index isn't going over the network |
42 |
and there is only 1 index, not say a million indexes (where |
43 |
compression might actually be useful for some kind of LRU subset of |
44 |
indexes to meet disk requirements). |
45 |
|
46 |
Anyway this is all moot since as you stated so well earlier, |
47 |
optimization comes last, so stop trying to do it now ;) |
48 |
|
49 |
-Alec |
50 |
|
51 |
> |
52 |
> http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip support. |
53 |
> Try open of that and normal open on esearch db; also compress with the same |
54 |
> lib to get right kind of file. |
55 |
> |
56 |
> Anyway - maybe this compression should be later added and optional. |
57 |
> |
58 |
> Tambet - technique evolves to art, art evolves to magic, magic evolves to |
59 |
> just doing. |
60 |
> |
61 |
> |
62 |
> 2008/12/2 Alec Warner <antarus@g.o> |
63 |
>> |
64 |
>> On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@×××××.com> wrote: |
65 |
>> > 2008/12/2 Emma Strubell <emma.strubell@×××××.com> |
66 |
>> >> |
67 |
>> >> True, true. Like I said, I don't really use overlays, so excuse my |
68 |
>> >> igonrance. |
69 |
>> > |
70 |
>> > Do you know an order of doing things: |
71 |
>> > |
72 |
>> > Rules of Optimization: |
73 |
>> > |
74 |
>> > Rule 1: Don't do it. |
75 |
>> > Rule 2 (for experts only): Don't do it yet. |
76 |
>> > |
77 |
>> > What this actually means - functionality comes first. Readability comes |
78 |
>> > next. Optimization comes last. Unless you are creating a fancy 3D engine |
79 |
>> > for |
80 |
>> > kung fu game. |
81 |
>> > |
82 |
>> > If you are going to exclude overlays, you are removing functionality - |
83 |
>> > and, |
84 |
>> > indeed, absolutely has-to-be-there functionality, because noone would |
85 |
>> > intuitively expect search function to search only one subset of |
86 |
>> > packages, |
87 |
>> > however reasonable this subset would be. So, you can't, just can't, add |
88 |
>> > this |
89 |
>> > package into portage base - you could write just another external search |
90 |
>> > package for portage. |
91 |
>> > |
92 |
>> > I looked this code a bit and: |
93 |
>> > Portage's "__init__.py" contains comment "# search functionality". After |
94 |
>> > this comment, there is a nice and simple search class. |
95 |
>> > It also contains method "def action_sync(...)", which contains |
96 |
>> > synchronization stuff. |
97 |
>> > |
98 |
>> > Now, search class will be initialized by setting up 3 databases - |
99 |
>> > porttree, |
100 |
>> > bintree and vartree, whatever those are. Those will be in self._dbs |
101 |
>> > array |
102 |
>> > and porttree will be in self._portdb. |
103 |
>> > |
104 |
>> > It contains some more methods: |
105 |
>> > _findname(...) will return result of self._portdb.findname(...) with |
106 |
>> > same |
107 |
>> > parameters or None if it does not exist. |
108 |
>> > Other methods will do similar things - map one or another method. |
109 |
>> > execute will do the real search... |
110 |
>> > Now - "for package in self.portdb.cp_all()" is important here ...it |
111 |
>> > currently loops over whole portage tree. All kinds of matching will be |
112 |
>> > done |
113 |
>> > inside. |
114 |
>> > self.portdb obviously points to porttree.py (unless it points to fake |
115 |
>> > tree). |
116 |
>> > cp_all will take all porttrees and do simple file search inside. This |
117 |
>> > method |
118 |
>> > should contain optional index search. |
119 |
>> > |
120 |
>> > self.porttrees = [self.porttree_root] + \ |
121 |
>> > [os.path.realpath(t) for t in |
122 |
>> > self.mysettings["PORTDIR_OVERLAY"].split()] |
123 |
>> > |
124 |
>> > So, self.porttrees contains list of trees - first of them is root, |
125 |
>> > others |
126 |
>> > are overlays. |
127 |
>> > |
128 |
>> > Now, what you have to do will not be harder just because of having |
129 |
>> > overlay |
130 |
>> > search, too. |
131 |
>> > |
132 |
>> > You have to create method def cp_index(self), which will return |
133 |
>> > dictionary |
134 |
>> > containing package names as keys. For oroot... will be |
135 |
>> > "self.porttrees[1:]", |
136 |
>> > not "self.porttrees" - this will only search overlays. d = {} will be |
137 |
>> > replaced with d = self.cp_index(). If index is not there, old version |
138 |
>> > will |
139 |
>> > be used (thus, you have to make internal porttrees variable, which |
140 |
>> > contains |
141 |
>> > all or all except first). |
142 |
>> > |
143 |
>> > Other methods used by search are xmatch and aux_get - first used several |
144 |
>> > times and last one used to get description. You have to cache results of |
145 |
>> > those specific queries and make them use your cache - as you can see, |
146 |
>> > those |
147 |
>> > parts of portage are already able to use overlays. Thus, you have to put |
148 |
>> > your code again in beginning of those functions - create index_xmatch |
149 |
>> > and |
150 |
>> > index_aux_get methods, then make those methods use them and return their |
151 |
>> > results unless those are None (or something other in case none is |
152 |
>> > already |
153 |
>> > legal result) - if they return None, old code will be run and do it's |
154 |
>> > job. |
155 |
>> > If index is not created, result is None. In index_** methods, just check |
156 |
>> > if |
157 |
>> > query is what you can answer and if it is, then answer it. |
158 |
>> > |
159 |
>> > Obviously, the simplest way to create your index is to delete index, |
160 |
>> > then |
161 |
>> > use those same methods to query for all nessecary information - and |
162 |
>> > fastest |
163 |
>> > way would be to add updating index directly into sync, which you could |
164 |
>> > do |
165 |
>> > later. |
166 |
>> > |
167 |
>> > Please, also, make those commands to turn index on and off (last one |
168 |
>> > should |
169 |
>> > also delete it to save disk space). Default should be off until it's |
170 |
>> > fast, |
171 |
>> > small and reliable. Also notice that if index is kept on hard drive, it |
172 |
>> > might be faster if it's compressed (gz, for example) - decompressing |
173 |
>> > takes |
174 |
>> > less time and more processing power than reading it fully out. |
175 |
>> |
176 |
>> I'm pretty sure your mistaken here, unless your index is stored on a |
177 |
>> floppy or something really slow. |
178 |
>> |
179 |
>> A disk read has 2 primary costs. |
180 |
>> |
181 |
>> Seek Time: Time for the head to seek to the sector of disk you want. |
182 |
>> Spin Time: Time for the platter to spin around such that the sector |
183 |
>> you want is under the read head. |
184 |
>> |
185 |
>> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120 |
186 |
>> rotations per second, so worst case (you just passed the sector you |
187 |
>> need) you need to wait 1/120th of a second (or 8ms). |
188 |
>> |
189 |
>> Seek Time is per hard drive, but most drives provide average seek |
190 |
>> times under 10ms. |
191 |
>> |
192 |
>> So it takes on average 18ms to get to your data, then you start |
193 |
>> reading. The index will not be that large (my esearchdb is 2 megs, |
194 |
>> but lets assume 10MB for this compressed index). |
195 |
>> |
196 |
>> I took a 10MB meg sqlite database and compressed it with gzip (default |
197 |
>> settings) down to 5 megs. |
198 |
>> gzip -d on the database takes 300ms, catting the decompressed data |
199 |
>> base takes 88ms (average of 5 runs, drop disk caches between runs). |
200 |
>> |
201 |
>> I then tried my vdb_metadata.pickle from |
202 |
>> /var/cache/edb/vdb_metadata.pickle |
203 |
>> |
204 |
>> 1.3megs compresses to 390k. |
205 |
>> |
206 |
>> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from |
207 |
>> disk. |
208 |
>> |
209 |
>> Your index would have to be very large or very fragmented on disk |
210 |
>> (requiring more than one seek) to see a significant gain in file |
211 |
>> compression (gzip scales linearly). |
212 |
>> |
213 |
>> In short, don't compress the index ;p |
214 |
>> |
215 |
>> > |
216 |
>> > Have luck! |
217 |
>> > |
218 |
>> >>> -----BEGIN PGP SIGNED MESSAGE----- |
219 |
>> >>> Hash: SHA1 |
220 |
>> >>> |
221 |
>> >>> Emma Strubell schrieb: |
222 |
>> >>> > 2) does anyone really need to search an overlay anyway? |
223 |
>> >>> |
224 |
>> >>> Of course. Take large (semi-)official overlays like sunrise. They can |
225 |
>> >>> easily be seen as a second portage tree. |
226 |
>> >>> -----BEGIN PGP SIGNATURE----- |
227 |
>> >>> Version: GnuPG v2.0.9 (GNU/Linux) |
228 |
>> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org |
229 |
>> >>> |
230 |
>> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt |
231 |
>> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S |
232 |
>> >>> =+lCO |
233 |
>> >>> -----END PGP SIGNATURE----- |
234 |
>> >>> |
235 |
>> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@××××××.eu> |
236 |
>> >> wrote: |
237 |
>> >> |
238 |
>> > |
239 |
>> > |
240 |
> |
241 |
> |