1 |
It might be that your hard drive is not that much slower than memory, then, |
2 |
but I really doubt this one ...or it could mean that reading gzip out is |
3 |
much slower than reading cat - and this one is highly probable. I mean, file |
4 |
size of gzip. |
5 |
|
6 |
Actually it's elementary logic that decompressing is faster than just |
7 |
loading. What I personally did use was *much *faster than without |
8 |
compressing, but that was also c++ application having this zip always in |
9 |
memory and this was also highly inefficiently stored data at first. |
10 |
|
11 |
I suggest you such test to understand me - make some file and write there |
12 |
character "a" about 10 000 000 times to get it that big, then try the same |
13 |
thing on that file. I think it's probable that it will be real fast to |
14 |
decompress the resulting file. |
15 |
|
16 |
Anyway, you have still made me think that at first, no zip should be used :) |
17 |
- just because your tests took several new variables in like speed of |
18 |
reading decompression utility from disk. |
19 |
|
20 |
Tambet - technique evolves to art, art evolves to magic, magic evolves to |
21 |
just doing. |
22 |
|
23 |
|
24 |
2008/12/2 Alec Warner <antarus@g.o> |
25 |
|
26 |
> On Tue, Dec 2, 2008 at 4:42 AM, Tambet <qtvali@×××××.com> wrote: |
27 |
> > About zipping.. Default settings might not really be good idea - i think |
28 |
> > that "fastest" might be even better. Considering that portage tree |
29 |
> contains |
30 |
> > same word again and again (like "applications") it needs pretty small |
31 |
> > dictionary to make it much smaller. Decompressing will not be reading |
32 |
> from |
33 |
> > disc, decompressing and writing back to disc as in your case probably - |
34 |
> try |
35 |
> > decompression to memory drive and you might get better numbers. |
36 |
> |
37 |
> I ran gzip -d -c file.gz > /dev/null, which should not write to disk. |
38 |
> |
39 |
> I tried again with gzip -1 and it still takes 29ms to decompress (even |
40 |
> with gzip -1) where a bare read takes 26ms. (I have a 2.6Ghz X2 which |
41 |
> is probably relevant to gzip decompression speed) |
42 |
> |
43 |
> > |
44 |
> > I have personally used compression in one c++ application and with |
45 |
> optimum |
46 |
> > settings, it made things much faster - those were files, where i had for |
47 |
> > example 65536 16-byte integers, which could be zeros and mostly were; I |
48 |
> > didnt care about creating better file format, but just compressed the |
49 |
> whole |
50 |
> > thing. |
51 |
> |
52 |
> I'm not saying compression won't make the index smaller. I'm saying |
53 |
> making the index smaller does not improve performance. If you have a |
54 |
> 10 meg file and you make it 1 meg, you do not increase performance |
55 |
> because you (on average) are not saving enough time reading the |
56 |
> smaller file; since you pay it in decompressing the smaller file |
57 |
> later. |
58 |
> |
59 |
> > |
60 |
> > I suggest you to compress esearch db, then decompress it to memory drive |
61 |
> and |
62 |
> > give us those numbers - might be considerably faster. |
63 |
> |
64 |
> gzip -d -c esearchdb.py.gz > /dev/null (compressed with gzip -1) takes |
65 |
> on average (6 trials, dropped caches between trials) takes 35.1666ms |
66 |
> |
67 |
> cat esearchdb.py > /dev/null (uncompressed) takes on average of 6 trials, |
68 |
> 24ms. |
69 |
> |
70 |
> The point is you use compression when you need to save space (sending |
71 |
> data over the network, or storing large amounts of data or a lot of |
72 |
> something). The index isn't going to be big (if it is bigger than 20 |
73 |
> or 30 meg I'll be surprised), the index isn't going over the network |
74 |
> and there is only 1 index, not say a million indexes (where |
75 |
> compression might actually be useful for some kind of LRU subset of |
76 |
> indexes to meet disk requirements). |
77 |
> |
78 |
> Anyway this is all moot since as you stated so well earlier, |
79 |
> optimization comes last, so stop trying to do it now ;) |
80 |
> |
81 |
> -Alec |
82 |
> |
83 |
> > |
84 |
> > http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip |
85 |
> support. |
86 |
> > Try open of that and normal open on esearch db; also compress with the |
87 |
> same |
88 |
> > lib to get right kind of file. |
89 |
> > |
90 |
> > Anyway - maybe this compression should be later added and optional. |
91 |
> > |
92 |
> > Tambet - technique evolves to art, art evolves to magic, magic evolves to |
93 |
> > just doing. |
94 |
> > |
95 |
> > |
96 |
> > 2008/12/2 Alec Warner <antarus@g.o> |
97 |
> >> |
98 |
> >> On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@×××××.com> wrote: |
99 |
> >> > 2008/12/2 Emma Strubell <emma.strubell@×××××.com> |
100 |
> >> >> |
101 |
> >> >> True, true. Like I said, I don't really use overlays, so excuse my |
102 |
> >> >> igonrance. |
103 |
> >> > |
104 |
> >> > Do you know an order of doing things: |
105 |
> >> > |
106 |
> >> > Rules of Optimization: |
107 |
> >> > |
108 |
> >> > Rule 1: Don't do it. |
109 |
> >> > Rule 2 (for experts only): Don't do it yet. |
110 |
> >> > |
111 |
> >> > What this actually means - functionality comes first. Readability |
112 |
> comes |
113 |
> >> > next. Optimization comes last. Unless you are creating a fancy 3D |
114 |
> engine |
115 |
> >> > for |
116 |
> >> > kung fu game. |
117 |
> >> > |
118 |
> >> > If you are going to exclude overlays, you are removing functionality - |
119 |
> >> > and, |
120 |
> >> > indeed, absolutely has-to-be-there functionality, because noone would |
121 |
> >> > intuitively expect search function to search only one subset of |
122 |
> >> > packages, |
123 |
> >> > however reasonable this subset would be. So, you can't, just can't, |
124 |
> add |
125 |
> >> > this |
126 |
> >> > package into portage base - you could write just another external |
127 |
> search |
128 |
> >> > package for portage. |
129 |
> >> > |
130 |
> >> > I looked this code a bit and: |
131 |
> >> > Portage's "__init__.py" contains comment "# search functionality". |
132 |
> After |
133 |
> >> > this comment, there is a nice and simple search class. |
134 |
> >> > It also contains method "def action_sync(...)", which contains |
135 |
> >> > synchronization stuff. |
136 |
> >> > |
137 |
> >> > Now, search class will be initialized by setting up 3 databases - |
138 |
> >> > porttree, |
139 |
> >> > bintree and vartree, whatever those are. Those will be in self._dbs |
140 |
> >> > array |
141 |
> >> > and porttree will be in self._portdb. |
142 |
> >> > |
143 |
> >> > It contains some more methods: |
144 |
> >> > _findname(...) will return result of self._portdb.findname(...) with |
145 |
> >> > same |
146 |
> >> > parameters or None if it does not exist. |
147 |
> >> > Other methods will do similar things - map one or another method. |
148 |
> >> > execute will do the real search... |
149 |
> >> > Now - "for package in self.portdb.cp_all()" is important here ...it |
150 |
> >> > currently loops over whole portage tree. All kinds of matching will be |
151 |
> >> > done |
152 |
> >> > inside. |
153 |
> >> > self.portdb obviously points to porttree.py (unless it points to fake |
154 |
> >> > tree). |
155 |
> >> > cp_all will take all porttrees and do simple file search inside. This |
156 |
> >> > method |
157 |
> >> > should contain optional index search. |
158 |
> >> > |
159 |
> >> > self.porttrees = [self.porttree_root] + \ |
160 |
> >> > [os.path.realpath(t) for t in |
161 |
> >> > self.mysettings["PORTDIR_OVERLAY"].split()] |
162 |
> >> > |
163 |
> >> > So, self.porttrees contains list of trees - first of them is root, |
164 |
> >> > others |
165 |
> >> > are overlays. |
166 |
> >> > |
167 |
> >> > Now, what you have to do will not be harder just because of having |
168 |
> >> > overlay |
169 |
> >> > search, too. |
170 |
> >> > |
171 |
> >> > You have to create method def cp_index(self), which will return |
172 |
> >> > dictionary |
173 |
> >> > containing package names as keys. For oroot... will be |
174 |
> >> > "self.porttrees[1:]", |
175 |
> >> > not "self.porttrees" - this will only search overlays. d = {} will be |
176 |
> >> > replaced with d = self.cp_index(). If index is not there, old version |
177 |
> >> > will |
178 |
> >> > be used (thus, you have to make internal porttrees variable, which |
179 |
> >> > contains |
180 |
> >> > all or all except first). |
181 |
> >> > |
182 |
> >> > Other methods used by search are xmatch and aux_get - first used |
183 |
> several |
184 |
> >> > times and last one used to get description. You have to cache results |
185 |
> of |
186 |
> >> > those specific queries and make them use your cache - as you can see, |
187 |
> >> > those |
188 |
> >> > parts of portage are already able to use overlays. Thus, you have to |
189 |
> put |
190 |
> >> > your code again in beginning of those functions - create index_xmatch |
191 |
> >> > and |
192 |
> >> > index_aux_get methods, then make those methods use them and return |
193 |
> their |
194 |
> >> > results unless those are None (or something other in case none is |
195 |
> >> > already |
196 |
> >> > legal result) - if they return None, old code will be run and do it's |
197 |
> >> > job. |
198 |
> >> > If index is not created, result is None. In index_** methods, just |
199 |
> check |
200 |
> >> > if |
201 |
> >> > query is what you can answer and if it is, then answer it. |
202 |
> >> > |
203 |
> >> > Obviously, the simplest way to create your index is to delete index, |
204 |
> >> > then |
205 |
> >> > use those same methods to query for all nessecary information - and |
206 |
> >> > fastest |
207 |
> >> > way would be to add updating index directly into sync, which you could |
208 |
> >> > do |
209 |
> >> > later. |
210 |
> >> > |
211 |
> >> > Please, also, make those commands to turn index on and off (last one |
212 |
> >> > should |
213 |
> >> > also delete it to save disk space). Default should be off until it's |
214 |
> >> > fast, |
215 |
> >> > small and reliable. Also notice that if index is kept on hard drive, |
216 |
> it |
217 |
> >> > might be faster if it's compressed (gz, for example) - decompressing |
218 |
> >> > takes |
219 |
> >> > less time and more processing power than reading it fully out. |
220 |
> >> |
221 |
> >> I'm pretty sure your mistaken here, unless your index is stored on a |
222 |
> >> floppy or something really slow. |
223 |
> >> |
224 |
> >> A disk read has 2 primary costs. |
225 |
> >> |
226 |
> >> Seek Time: Time for the head to seek to the sector of disk you want. |
227 |
> >> Spin Time: Time for the platter to spin around such that the sector |
228 |
> >> you want is under the read head. |
229 |
> >> |
230 |
> >> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120 |
231 |
> >> rotations per second, so worst case (you just passed the sector you |
232 |
> >> need) you need to wait 1/120th of a second (or 8ms). |
233 |
> >> |
234 |
> >> Seek Time is per hard drive, but most drives provide average seek |
235 |
> >> times under 10ms. |
236 |
> >> |
237 |
> >> So it takes on average 18ms to get to your data, then you start |
238 |
> >> reading. The index will not be that large (my esearchdb is 2 megs, |
239 |
> >> but lets assume 10MB for this compressed index). |
240 |
> >> |
241 |
> >> I took a 10MB meg sqlite database and compressed it with gzip (default |
242 |
> >> settings) down to 5 megs. |
243 |
> >> gzip -d on the database takes 300ms, catting the decompressed data |
244 |
> >> base takes 88ms (average of 5 runs, drop disk caches between runs). |
245 |
> >> |
246 |
> >> I then tried my vdb_metadata.pickle from |
247 |
> >> /var/cache/edb/vdb_metadata.pickle |
248 |
> >> |
249 |
> >> 1.3megs compresses to 390k. |
250 |
> >> |
251 |
> >> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from |
252 |
> >> disk. |
253 |
> >> |
254 |
> >> Your index would have to be very large or very fragmented on disk |
255 |
> >> (requiring more than one seek) to see a significant gain in file |
256 |
> >> compression (gzip scales linearly). |
257 |
> >> |
258 |
> >> In short, don't compress the index ;p |
259 |
> >> |
260 |
> >> > |
261 |
> >> > Have luck! |
262 |
> >> > |
263 |
> >> >>> -----BEGIN PGP SIGNED MESSAGE----- |
264 |
> >> >>> Hash: SHA1 |
265 |
> >> >>> |
266 |
> >> >>> Emma Strubell schrieb: |
267 |
> >> >>> > 2) does anyone really need to search an overlay anyway? |
268 |
> >> >>> |
269 |
> >> >>> Of course. Take large (semi-)official overlays like sunrise. They |
270 |
> can |
271 |
> >> >>> easily be seen as a second portage tree. |
272 |
> >> >>> -----BEGIN PGP SIGNATURE----- |
273 |
> >> >>> Version: GnuPG v2.0.9 (GNU/Linux) |
274 |
> >> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org |
275 |
> >> >>> |
276 |
> >> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt |
277 |
> >> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S |
278 |
> >> >>> =+lCO |
279 |
> >> >>> -----END PGP SIGNATURE----- |
280 |
> >> >>> |
281 |
> >> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann |
282 |
> <lists@××××××.eu> |
283 |
> >> >> wrote: |
284 |
> >> >> |
285 |
> >> > |
286 |
> >> > |
287 |
> > |
288 |
> > |
289 |
> |