Gentoo Archives: gentoo-portage-dev

From:	Tambet <qtvali@×××××.com>
To:	gentoo-portage-dev@l.g.o
Subject:	Re: [gentoo-portage-dev] Re: search functionality in emerge
Date:	Tue, 02 Dec 2008 12:42:20
Message-Id:	`cea53e3c0812020442s6fcb36ao7e8fcee400a3370b@mail.gmail.com`
In Reply to:	Re: [gentoo-portage-dev] Re: search functionality in emerge by Alec Warner

1	About zipping.. Default settings might not really be good idea - i think
2	that "fastest" might be even better. Considering that portage tree contains
3	same word again and again (like "applications") it needs pretty small
4	dictionary to make it much smaller. Decompressing will not be reading from
5	disc, decompressing and writing back to disc as in your case probably - try
6	decompression to memory drive and you might get better numbers.
7
8	I have personally used compression in one c++ application and with optimum
9	settings, it made things much faster - those were files, where i had for
10	example 65536 16-byte integers, which could be zeros and mostly were; I
11	didnt care about creating better file format, but just compressed the whole
12	thing.
13
14	I suggest you to compress esearch db, then decompress it to memory drive and
15	give us those numbers - might be considerably faster.
16
17	http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip support.
18	Try open of that and normal open on esearch db; also compress with the same
19	lib to get right kind of file.
20
21	Anyway - maybe this compression should be later added and optional.
22
23	Tambet - technique evolves to art, art evolves to magic, magic evolves to
24	just doing.
25
26
27	2008/12/2 Alec Warner <antarus@g.o>
28
29	> On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@×××××.com> wrote:
30	> > 2008/12/2 Emma Strubell <emma.strubell@×××××.com>
31	> >>
32	> >> True, true. Like I said, I don't really use overlays, so excuse my
33	> >> igonrance.
34	> >
35	> > Do you know an order of doing things:
36	> >
37	> > Rules of Optimization:
38	> >
39	> > Rule 1: Don't do it.
40	> > Rule 2 (for experts only): Don't do it yet.
41	> >
42	> > What this actually means - functionality comes first. Readability comes
43	> > next. Optimization comes last. Unless you are creating a fancy 3D engine
44	> for
45	> > kung fu game.
46	> >
47	> > If you are going to exclude overlays, you are removing functionality -
48	> and,
49	> > indeed, absolutely has-to-be-there functionality, because noone would
50	> > intuitively expect search function to search only one subset of packages,
51	> > however reasonable this subset would be. So, you can't, just can't, add
52	> this
53	> > package into portage base - you could write just another external search
54	> > package for portage.
55	> >
56	> > I looked this code a bit and:
57	> > Portage's "__init__.py" contains comment "# search functionality". After
58	> > this comment, there is a nice and simple search class.
59	> > It also contains method "def action_sync(...)", which contains
60	> > synchronization stuff.
61	> >
62	> > Now, search class will be initialized by setting up 3 databases -
63	> porttree,
64	> > bintree and vartree, whatever those are. Those will be in self._dbs array
65	> > and porttree will be in self._portdb.
66	> >
67	> > It contains some more methods:
68	> > _findname(...) will return result of self._portdb.findname(...) with same
69	> > parameters or None if it does not exist.
70	> > Other methods will do similar things - map one or another method.
71	> > execute will do the real search...
72	> > Now - "for package in self.portdb.cp_all()" is important here ...it
73	> > currently loops over whole portage tree. All kinds of matching will be
74	> done
75	> > inside.
76	> > self.portdb obviously points to porttree.py (unless it points to fake
77	> tree).
78	> > cp_all will take all porttrees and do simple file search inside. This
79	> method
80	> > should contain optional index search.
81	> >
82	> > self.porttrees = [self.porttree_root] + \
83	> > [os.path.realpath(t) for t in
84	> self.mysettings["PORTDIR_OVERLAY"].split()]
85	> >
86	> > So, self.porttrees contains list of trees - first of them is root, others
87	> > are overlays.
88	> >
89	> > Now, what you have to do will not be harder just because of having
90	> overlay
91	> > search, too.
92	> >
93	> > You have to create method def cp_index(self), which will return
94	> dictionary
95	> > containing package names as keys. For oroot... will be
96	> "self.porttrees[1:]",
97	> > not "self.porttrees" - this will only search overlays. d = {} will be
98	> > replaced with d = self.cp_index(). If index is not there, old version
99	> will
100	> > be used (thus, you have to make internal porttrees variable, which
101	> contains
102	> > all or all except first).
103	> >
104	> > Other methods used by search are xmatch and aux_get - first used several
105	> > times and last one used to get description. You have to cache results of
106	> > those specific queries and make them use your cache - as you can see,
107	> those
108	> > parts of portage are already able to use overlays. Thus, you have to put
109	> > your code again in beginning of those functions - create index_xmatch and
110	> > index_aux_get methods, then make those methods use them and return their
111	> > results unless those are None (or something other in case none is already
112	> > legal result) - if they return None, old code will be run and do it's
113	> job.
114	> > If index is not created, result is None. In index_** methods, just check
115	> if
116	> > query is what you can answer and if it is, then answer it.
117	> >
118	> > Obviously, the simplest way to create your index is to delete index, then
119	> > use those same methods to query for all nessecary information - and
120	> fastest
121	> > way would be to add updating index directly into sync, which you could do
122	> > later.
123	> >
124	> > Please, also, make those commands to turn index on and off (last one
125	> should
126	> > also delete it to save disk space). Default should be off until it's
127	> fast,
128	> > small and reliable. Also notice that if index is kept on hard drive, it
129	> > might be faster if it's compressed (gz, for example) - decompressing
130	> takes
131	> > less time and more processing power than reading it fully out.
132	>
133	> I'm pretty sure your mistaken here, unless your index is stored on a
134	> floppy or something really slow.
135	>
136	> A disk read has 2 primary costs.
137	>
138	> Seek Time: Time for the head to seek to the sector of disk you want.
139	> Spin Time: Time for the platter to spin around such that the sector
140	> you want is under the read head.
141	>
142	> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120
143	> rotations per second, so worst case (you just passed the sector you
144	> need) you need to wait 1/120th of a second (or 8ms).
145	>
146	> Seek Time is per hard drive, but most drives provide average seek
147	> times under 10ms.
148	>
149	> So it takes on average 18ms to get to your data, then you start
150	> reading. The index will not be that large (my esearchdb is 2 megs,
151	> but lets assume 10MB for this compressed index).
152	>
153	> I took a 10MB meg sqlite database and compressed it with gzip (default
154	> settings) down to 5 megs.
155	> gzip -d on the database takes 300ms, catting the decompressed data
156	> base takes 88ms (average of 5 runs, drop disk caches between runs).
157	>
158	> I then tried my vdb_metadata.pickle from /var/cache/edb/vdb_metadata.pickle
159	>
160	> 1.3megs compresses to 390k.
161	>
162	> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from
163	> disk.
164	>
165	> Your index would have to be very large or very fragmented on disk
166	> (requiring more than one seek) to see a significant gain in file
167	> compression (gzip scales linearly).
168	>
169	> In short, don't compress the index ;p
170	>
171	> >
172	> > Have luck!
173	> >
174	> >>> -----BEGIN PGP SIGNED MESSAGE-----
175	> >>> Hash: SHA1
176	> >>>
177	> >>> Emma Strubell schrieb:
178	> >>> > 2) does anyone really need to search an overlay anyway?
179	> >>>
180	> >>> Of course. Take large (semi-)official overlays like sunrise. They can
181	> >>> easily be seen as a second portage tree.
182	> >>> -----BEGIN PGP SIGNATURE-----
183	> >>> Version: GnuPG v2.0.9 (GNU/Linux)
184	> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
185	> >>>
186	> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt
187	> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S
188	> >>> =+lCO
189	> >>> -----END PGP SIGNATURE-----
190	> >>>
191	> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@××××××.eu>
192	> >> wrote:
193	> >>
194	> >
195	> >
196	>

Replies

Subject	Author
Re: [gentoo-portage-dev] Re: search functionality in emerge	Tambet <qtvali@×××××.com>
Re: [gentoo-portage-dev] Re: search functionality in emerge	Alec Warner <antarus@g.o>

Report Message

Find on MARC Find on Google Groups