Gentoo Archives: gentoo-dev

From: Yannick Koehler <yannick.koehler@××××××××.com>
To: gentoo-dev@g.o
Subject: [gentoo-dev] Gentoo XML Database: More Data
Date: Tue, 11 Feb 2003 15:05:15
Message-Id: 200302110956.15022.yannick.koehler@colubris.com
In Reply to: [gentoo-dev] Gentoo XML Database by Yannick Koehler
1 On February 7, 2003 12:34 pm, Yannick Koehler wrote:
2 > Discussion with carspaski reveal thought that the use of the database will
3 > actually not speed up emerge. Because emerge loads the cache inside an
4 > internal memory database and python allow him to leave that in memory in
5 > between runs making it very fast and efficient as only the require entry of
6 > the database gets loaded instead of the whole database.
7
8 Just a note about that comment I made. When I said that it wouldn't speed-up
9 emerge it was related to specific functions. For example, if you do for the
10 first time:
11
12 emerge kde
13
14 Emerge will then fetch the kde ebuild get the dependency and fetch all the
15 dependency. Cache this information inside its internal persistent db and
16 then execute the operation.
17
18 In a DB mode, the database need to be loaded in some way or the index. It is
19 hard to imagine that the number of I/O will actually be less than the current
20 one described above.
21
22 But, there is cases where a db would speed up portage and that's why the xml
23 file is getting interesting. It allow to import into a db who knows about
24 xml or try things using xml/text related tools.
25
26 > Some benefit I see from the xml db is for side-tools, for example search
27 > description of ebuilds is faster when using xml db as it is a single file
28 > and software only look for string that start with <description>. One can
29 > use grep/regexp to do such query or built an xml capable application.
30
31 I have done the following experiment. When I posted the original mail, I
32 generated the gentoo.xml file. The file was actually 4762563 bytes. I did
33 report it was 9525071 bytes but this was wrong. My script had generated a
34 double file... I found this by using
35
36 grep "version name=\"kdelibs-3.1" gentoo.xml
37
38 which outputted two instance. After correcting the file was smaller.
39 Something that got my attention also is that generating a gentoo.xml today
40 got me a 4852253 bytes file. Now if you compare the date/size:
41
42 2003-02-07 12:34 -> 4762563 bytes
43 2003-02-11 09:10 -> 4852253 bytes
44
45 This is a 89680 bytes difference. Running emerge rsync daily is giving me
46 more than 1.2 megs a shot. So quick calculation:
47
48 11 Feb - 7 Feb = 4 days.
49 4 * 1.2 megs = 4.8 megs
50 4 * 89k = 356k
51 4.8 - 356k = ~4.45 megs
52
53 Which means that, if the gentoo.xml contained all the info required for
54 calculating the dependencies and only fetching required ebuilds which I'm
55 pretty sure it does, would mean that I have wasted 4.45 megs of bandwidth
56 this past 4 days.
57
58 Now consider that I'm not alone and that in my case I have a shared repository
59 both at home and at work, this is a huge waste of bandwidth for only 4
60 days... And those are bytes, not bits...
61
62 Other information...
63
64 ykoehler@corneille ykoehler $ time grep "version name=" -c gentoo.xml
65 gentoo2.xml
66 gentoo.xml:7262
67 gentoo2.xml:7373
68
69 real 0m0.107s
70 user 0m0.010s
71 sys 0m0.050s
72
73 It takes less than 1 seconds for grep to parse the file and retrieve all
74 occurrences of version name=". While it is true that grep doesn't generate
75 data structure in memory and parse the inner part of the version tag, this
76 actually make me re-think my original discussion with carpaski where we got
77 to the conclusion that the speedup for emerge would be minimal. I now
78 actually think that given a proper xml file which would minimize even more
79 parsing by adding to the xml file generation has actually possibility for
80 huge saving in bandwidth, hard disk space and speed on a local pc.
81
82 It also take less time to generate the xml file than to issue an emerge rsync
83 for a 1 day of changes. For example, I have run emerge rsync this morning:
84
85 rsync[15612] (receiver) heap statistics:
86 arena: 5362104 (bytes from sbrk)
87 ordblks: 551 (chunks not in use)
88 smblks: 2
89 hblks: 1 (chunks from mmap)
90 hblkhd: 258048 (bytes from mmap)
91 usmblks: 0
92 fsmblks: 40
93 uordblks: 4667912 (bytes used)
94 fordblks: 694192 (bytes free)
95 keepcost: 53464 (bytes in releasable chunk)
96
97 Number of files: 36464
98 Number of files transferred: 14444
99 Total file size: 29101431 bytes
100 Total transferred file size: 14445832 bytes
101 Literal data: 69 bytes
102 Matched data: 14445763 bytes
103 File list size: 848703
104 Total bytes written: 408076
105 Total bytes read: 1349498
106
107 wrote 408076 bytes read 1349498 bytes 1858.88 bytes/sec
108 total size is 29101431 speedup is 16.56
109
110 >>> Updating Portage cache...
111
112
113 real 16m19.682s
114 user 0m21.990s
115 sys 0m40.790s
116
117 So 16 min. later and 1.3 megs I got the database update for today...
118
119 corneille dep # time /home/ykoehler/test.sh >/home/ykoehler/gentoo3.xml
120
121 real 1m17.284s
122 user 0m21.780s
123 sys 0m32.350s
124
125 Generated the gentoo.xml file at size 4865557 bytes. Which if I had rsynced
126 from the server would have transform into (4865557 - 4852253) ~13304 bytes.
127
128 I heard that there was already works on getting portage to use a real db such
129 as berkeley or mysql etc.. In any case I think the distribution format that
130 make most sense is xml. This format can easily be manipulated using xslt to
131 fit the need of many people and can be use with mostly any existing text
132 tools. Using xsl it also can be nicely converted into HTML and a dependency
133 tree is easily built from there. RSync could quickly figure out which part
134 changes and update those in a fraction of time it takes today to diff the
135 trees.
136
137 Hard Info:
138 P3 800 Mhz
139 Slow IDE Hard Disk 5400 rpm
140 256 megs ram
141 i810 chipset
142
143 So, with those numbers now extracted, I'm about to attempt to move emerge to
144 this system on my PC and get you more "real" numbers using that mode.
145
146 --
147
148 Yannick Koehler
149
150 --
151 gentoo-dev@g.o mailing list