1 |
On February 7, 2003 12:34 pm, Yannick Koehler wrote: |
2 |
> Discussion with carspaski reveal thought that the use of the database will |
3 |
> actually not speed up emerge. Because emerge loads the cache inside an |
4 |
> internal memory database and python allow him to leave that in memory in |
5 |
> between runs making it very fast and efficient as only the require entry of |
6 |
> the database gets loaded instead of the whole database. |
7 |
|
8 |
Just a note about that comment I made. When I said that it wouldn't speed-up |
9 |
emerge it was related to specific functions. For example, if you do for the |
10 |
first time: |
11 |
|
12 |
emerge kde |
13 |
|
14 |
Emerge will then fetch the kde ebuild get the dependency and fetch all the |
15 |
dependency. Cache this information inside its internal persistent db and |
16 |
then execute the operation. |
17 |
|
18 |
In a DB mode, the database need to be loaded in some way or the index. It is |
19 |
hard to imagine that the number of I/O will actually be less than the current |
20 |
one described above. |
21 |
|
22 |
But, there is cases where a db would speed up portage and that's why the xml |
23 |
file is getting interesting. It allow to import into a db who knows about |
24 |
xml or try things using xml/text related tools. |
25 |
|
26 |
> Some benefit I see from the xml db is for side-tools, for example search |
27 |
> description of ebuilds is faster when using xml db as it is a single file |
28 |
> and software only look for string that start with <description>. One can |
29 |
> use grep/regexp to do such query or built an xml capable application. |
30 |
|
31 |
I have done the following experiment. When I posted the original mail, I |
32 |
generated the gentoo.xml file. The file was actually 4762563 bytes. I did |
33 |
report it was 9525071 bytes but this was wrong. My script had generated a |
34 |
double file... I found this by using |
35 |
|
36 |
grep "version name=\"kdelibs-3.1" gentoo.xml |
37 |
|
38 |
which outputted two instance. After correcting the file was smaller. |
39 |
Something that got my attention also is that generating a gentoo.xml today |
40 |
got me a 4852253 bytes file. Now if you compare the date/size: |
41 |
|
42 |
2003-02-07 12:34 -> 4762563 bytes |
43 |
2003-02-11 09:10 -> 4852253 bytes |
44 |
|
45 |
This is a 89680 bytes difference. Running emerge rsync daily is giving me |
46 |
more than 1.2 megs a shot. So quick calculation: |
47 |
|
48 |
11 Feb - 7 Feb = 4 days. |
49 |
4 * 1.2 megs = 4.8 megs |
50 |
4 * 89k = 356k |
51 |
4.8 - 356k = ~4.45 megs |
52 |
|
53 |
Which means that, if the gentoo.xml contained all the info required for |
54 |
calculating the dependencies and only fetching required ebuilds which I'm |
55 |
pretty sure it does, would mean that I have wasted 4.45 megs of bandwidth |
56 |
this past 4 days. |
57 |
|
58 |
Now consider that I'm not alone and that in my case I have a shared repository |
59 |
both at home and at work, this is a huge waste of bandwidth for only 4 |
60 |
days... And those are bytes, not bits... |
61 |
|
62 |
Other information... |
63 |
|
64 |
ykoehler@corneille ykoehler $ time grep "version name=" -c gentoo.xml |
65 |
gentoo2.xml |
66 |
gentoo.xml:7262 |
67 |
gentoo2.xml:7373 |
68 |
|
69 |
real 0m0.107s |
70 |
user 0m0.010s |
71 |
sys 0m0.050s |
72 |
|
73 |
It takes less than 1 seconds for grep to parse the file and retrieve all |
74 |
occurrences of version name=". While it is true that grep doesn't generate |
75 |
data structure in memory and parse the inner part of the version tag, this |
76 |
actually make me re-think my original discussion with carpaski where we got |
77 |
to the conclusion that the speedup for emerge would be minimal. I now |
78 |
actually think that given a proper xml file which would minimize even more |
79 |
parsing by adding to the xml file generation has actually possibility for |
80 |
huge saving in bandwidth, hard disk space and speed on a local pc. |
81 |
|
82 |
It also take less time to generate the xml file than to issue an emerge rsync |
83 |
for a 1 day of changes. For example, I have run emerge rsync this morning: |
84 |
|
85 |
rsync[15612] (receiver) heap statistics: |
86 |
arena: 5362104 (bytes from sbrk) |
87 |
ordblks: 551 (chunks not in use) |
88 |
smblks: 2 |
89 |
hblks: 1 (chunks from mmap) |
90 |
hblkhd: 258048 (bytes from mmap) |
91 |
usmblks: 0 |
92 |
fsmblks: 40 |
93 |
uordblks: 4667912 (bytes used) |
94 |
fordblks: 694192 (bytes free) |
95 |
keepcost: 53464 (bytes in releasable chunk) |
96 |
|
97 |
Number of files: 36464 |
98 |
Number of files transferred: 14444 |
99 |
Total file size: 29101431 bytes |
100 |
Total transferred file size: 14445832 bytes |
101 |
Literal data: 69 bytes |
102 |
Matched data: 14445763 bytes |
103 |
File list size: 848703 |
104 |
Total bytes written: 408076 |
105 |
Total bytes read: 1349498 |
106 |
|
107 |
wrote 408076 bytes read 1349498 bytes 1858.88 bytes/sec |
108 |
total size is 29101431 speedup is 16.56 |
109 |
|
110 |
>>> Updating Portage cache... |
111 |
|
112 |
|
113 |
real 16m19.682s |
114 |
user 0m21.990s |
115 |
sys 0m40.790s |
116 |
|
117 |
So 16 min. later and 1.3 megs I got the database update for today... |
118 |
|
119 |
corneille dep # time /home/ykoehler/test.sh >/home/ykoehler/gentoo3.xml |
120 |
|
121 |
real 1m17.284s |
122 |
user 0m21.780s |
123 |
sys 0m32.350s |
124 |
|
125 |
Generated the gentoo.xml file at size 4865557 bytes. Which if I had rsynced |
126 |
from the server would have transform into (4865557 - 4852253) ~13304 bytes. |
127 |
|
128 |
I heard that there was already works on getting portage to use a real db such |
129 |
as berkeley or mysql etc.. In any case I think the distribution format that |
130 |
make most sense is xml. This format can easily be manipulated using xslt to |
131 |
fit the need of many people and can be use with mostly any existing text |
132 |
tools. Using xsl it also can be nicely converted into HTML and a dependency |
133 |
tree is easily built from there. RSync could quickly figure out which part |
134 |
changes and update those in a fraction of time it takes today to diff the |
135 |
trees. |
136 |
|
137 |
Hard Info: |
138 |
P3 800 Mhz |
139 |
Slow IDE Hard Disk 5400 rpm |
140 |
256 megs ram |
141 |
i810 chipset |
142 |
|
143 |
So, with those numbers now extracted, I'm about to attempt to move emerge to |
144 |
this system on my PC and get you more "real" numbers using that mode. |
145 |
|
146 |
-- |
147 |
|
148 |
Yannick Koehler |
149 |
|
150 |
-- |
151 |
gentoo-dev@g.o mailing list |