Gentoo Archives: gentoo-soc

From: Priit Laes <plaes@×××××.org>
To: gentoo-soc <gentoo-soc@l.g.o>
Cc: leio <leio@g.o>, ferringb <ferringb@g.o>
Subject: [gentoo-soc] Project Grumpy - weekly report #2
Date: Wed, 09 Jun 2010 17:45:12
Message-Id: 1276105489.16189.3.camel@chi
1 This is a weekly progress report no. 2 for Project Grumpy.
2
3 As reported previously, I am building a system to index portage packages
4 and related metadata to make package maintainership a bit easier for
5 developers.
6
7 First, a few words about the document metadata storage. For this project, the
8 plan is to use a document-oriented and schema-free database (MongoDB) instead
9 of a regular relational database system (like SQLite or PostgreSQL).
10
11 This also means that we can create a single document collection, where
12 documents correspond to simply "category/package" and collection containing
13 whole ebuild tree.
14
15 Document itself in the collection, is just a JSON-formatted dictionary with
16 following structure (beware, this is work in progress, so some things are
17 still missing)::
18
19 {
20 # "package/category" (primary index, unique)
21 '_id' : string,
22
23 # Version of the schema, used internally (just in case)
24 'schema_ver' : integer,
25
26 # Package category
27 'cat' : string,
28
29 # Package name
30 'pkg' : string,
31
32 ## Data from metadata.xml
33 # List of herds maintaining this package
34 'herds' : [ string, ... ],
35 # Long description of the package
36 'ldesc' : string,
37 # List of maintainers (by email addresses)
38 'maintainers' : [ string, ... ],
39
40 ## Data from ebuilds itself (but should be general)
41 # Description
42 "desc" : string,
43 # Upstream url(s) (FIXME: Do we need list here?)
44 'homepage' : string,
45
46 # Array of all the package versions and their specific info
47 'ebuilds' : [
48 # Package version (from category/package-version)
49 'version' : string,
50
51 # Eapi version
52 "eapi" : integer,
53 # List of USE flags supported by this ebuild
54 'iuse' : [ string, ... ],
55 # Package keywords ("x86", "~amd64", ...)
56 'keywords : [ string, ... ],
57 # Licenses
58 'licence' : [ string, ... ],
59 # Package slot
60 'slot' : string,
61
62 # Need to figure out proper structure for these, so we can also
63 # map out USE flags ;)
64 'depend' : TODO!!!
65 'rdepend' : TODO!!!
66 ]
67 }
68
69 So how about querying the data? That's easy. (Please note we are using MongoDB
70 shell). So, what if a developer wants to know which packages he is supposedly
71 maintaining::
72
73 > db.ebuilds.find({'maintainers' : '...@g.o' })
74 {... document data ...} # (Too much info :) )
75 > db.ebuilds.find({'maintainers' : '...@g.o' }).count()
76 7
77
78 And the results come fast. I mean really fast.
79 Ok, how about checking how many packages under 'dev-python' are using specific
80 EAPI version::
81
82 > db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 0}).count()
83 202
84 > db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 1}).count()
85 3
86 > db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 2}).count()
87 255
88 > db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 3}).count()
89 125
90 > db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 4}).count()
91 0
92 > db.ebuilds.find({'cat' : 'dev-python' }).count()
93 504
94 > 202+3+255+125 - 504
95 81
96
97 Ahem.. looks like we have a "design issue" with our document structure. So
98 back to the drawing board.
99
100 Last week's progress report
101 ===========================
102
103 Last week's progress has been a bit slow, I have mostly played with document
104 structure and played a bit with pkgcore's internals. Although I now have
105 portage contents inside the database the document structure itself is far from
106 ideal (as you can see from the example with EAPI counts given earlier).
107
108 I have committed some of the stuff I have been working on into Grumpy's repo,
109 so in case you are interested check it out from [1].
110
111 [1] http://git.overlays.gentoo.org/gitweb/?p=proj/grumpy.git;a=summary
112
113 First a warning, the portage->mongodb syncer is slow. I mean really slow - it
114 takes about 3 hours (or even more) on my laptop to fully scan the contents of
115 portage and store the data in database.
116
117 Plans for current week
118 ======================
119
120 1) Speed up the portage syncer
121 2) Improve document structure
122
123 Päikest,
124 Priit :)