Gentoo Archives: gentoo-dev

From: "Marijn Schouten (hkBst)" <hkBst@g.o>
To: gentoo-dev@l.g.o
Cc: PackageKit users and developers list <packagekit@×××××××××××××××××.org>, Paul Wise <pabs@××××××.org>, Christian Faulhammer <fauli@g.o>, "Petteri Räty" <betelgeuse@g.o>, Robert Buchholz <rbu@g.o>
Subject: Re: [packagekit] [gentoo-dev] Inviting you to project "PackageMap"
Date: Thu, 18 Jun 2009 09:13:41
Message-Id: 4A3A03B4.6030904@gentoo.org
In Reply to: Re: [packagekit] [gentoo-dev] Inviting you to project "PackageMap" by Sebastian Pipping
1 -----BEGIN PGP SIGNED MESSAGE-----
2 Hash: SHA1
3
4 Sebastian Pipping wrote:
5 > Marijn Schouten (hkBst) wrote:
6 >> Sebastian Pipping wrote:
7 >>> I start to understand the real benefits of moving a larger
8 >>> part of the maintenance down to the distro level as you proposed.
9 >>> Okay, let's add support for CPEs at distro package level
10 >>> and sync up and down with the central packagemap database.
11 >>> Please contact me for collaboration on sync scripts
12 >>> and "modeling" of details.
13 >> Do we not already have enough information available to automatically determine
14 >> derived unique identifiers like CPE?
15 >>
16 >> We have the package homepage and the package name (and the package category) and
17 >> the combination should be enough information to do direct comparisons to data
18 >> gathered from other repos (assuming they also contain such data).
19 >
20 > You are asking a valid question. The homepage links can be a great
21 > helper in mapping and they have been of help already for the mapping
22 > of the first 1000 Gentoo packages in packagemap.
23 >
24 > However it might not be as easy you make it sound, as there are
25 > a few things that complicate things and produce extra work:
26 >
27 > - In many cases a project can be reached from several URLs.
28 > For a project on SF.net you might have
29 > - http://sf.net/projects/${name}
30 > - http://${name}.sf.net/
31 > - http://www.${name}.org/
32 > That case can be handled rather easily but there are many more
33 > special cases and a manual map may be required for stuff that's
34 > not hosted on a larger hosting site.
35
36 But homepage is just ONE of the things that help you to identify a package. Some
37 packages that are the same will have different homepages and some packages which
38 are different will have the same homepage. If you take just homepage, package
39 name into account and the fact that packages from the same repo are different,
40 you can probably match over 95% of all packages correctly.
41
42 > - Split packages (think Git or Qt) may all have the same homepage.
43 > In Debian the source package might help there, in Gentoo you'd
44 > have to do common prefix detection or so, that's special
45 > cases again, and continuous review that it still does what you need.
46
47 Neither of the gits gentoo has seems very split, so I'll only address qt. Gentoo
48 has qt-core and qt-svg (and many more). I would say that they would each have to
49 get a different CPE and that none of them is equivalent to a package in another
50 or the same distro that has all of qt combined. Packages that get manually split
51 are a minority AFAIK, though texlive is another big one that comes to mind.
52 Debian does splitting into ``normal'' and ``devel'' packages. Has it been
53 decided what to do with those?
54 Now that you got me thinking about split packages, I realize that the exact
55 files installed by a package are also all by themselves a way to get over 95%
56 correct matching. For distros (like Gentoo) that have packages that have flags
57 that influence the list of installed files you must decide whether to add them
58 to the database last, or whether you will try to use an imprecise file list.
59
60 >> For example you can determine automatically that gentoo:dev-scheme/gambit and
61 >> debian:gambc are the same package because although their names differ they have
62 >> the same homepage and share a category.
63 >
64 > To detect equal categories you need a map for categories for all
65 > participating distros. Yes, it's smaller than mapping all packages
66 > but it involves a manual map and keeping it in sync.
67
68 No, there need not be a manual mapping. There is no reason to do true/false
69 comparisons. All we need is a distance function, like for example Levenshtein
70 distance (http://en.wikipedia.org/wiki/Levenshtein_distance). Actually on second
71 thought Levenshtein distance is probably not what we want, since we would be
72 more interested in how much strings have in common than in how much they differ.
73 I think the idea is clear though.
74
75 > Another word on homepage collisions: A few days before I wrote
76 > a script that builds a map from homepages to packagenames for the
77 > whole Gentoo tree (code/gentoo/gentoo-world-to-homepage-map.sh).
78 > The generated table from my run was 12330 lines long, each line for
79 > a different package.
80 >
81 > If you run an analysis over that table you see that many
82 > homepages appear many more times than just once.
83 > Here's the top ten:
84 >
85 > 68 http://www.gnome.org/
86 > 67 http://www.gentoo.org/
87 > 58 http://www.gentoo.org/proj/en/perl/
88 > 42 http://lingucomponent.openoffice.org/
89 > 26 http://www.kde.org/
90 > 25 http://www.gentoo.org
91 > 20 http://sourceforge.net/projects/synce/
92 > 19 http://www.trolltech.com/
93 > 19 http://search.cpan.org/~rjbs/
94 > 18 http://opensuse.foehr-it.de/
95
96 texlive with (http://www.tug.org/texlive/) seems to be missing from this list.
97
98 $ eix -H http://www.tug.org/texlive/ | tail -n 1
99 Found 79 matches.
100
101 I suspect you used grep (or whatever) to construct your data, instead of using
102 the package manager or a tool that knows how to extract the data available in
103 packages (and eclasses).
104
105 > The command I used is
106 >
107 > $ sed 's| *.*$||' homepage-to-package.txt \
108 > | sort | uniq -c | sort -n -r | head -n 10
109 >
110 > I think this three cases alone show that it would be
111
112 I'm not sure which 3 cases you mean.
113
114 > - also a lot of work
115 > - be many special cases
116 > - still require manual mappings here and there
117 >
118 > Another disadvantage is the current static XML approach of
119 > packagemap is language independent. We can easily build
120 > tools for packagemap in any language that has an XML parser.
121
122 I agree that XML is a disadvantage, but not that it is language independent. ;P
123
124 > If the data actually is the code we suddenly have to keep
125 > code from different languages in precise special case sync.
126
127 I did not argue for a data format nor for a specific language nor coding style
128 nor anything that seems to match what you are saying here; I only spoke about
129 how to populate the CPE database.
130
131 > I'm not sure if the approach you describe is less work in total.
132 > I guess to find out we'd have to do both in parallel :-)
133 >
134 > It could be interesting how much the list of homepages
135 > in say Debian packages and Gentoo packages overlap.
136
137 It would certainly be interesting.
138
139 Marijn
140
141 - --
142 If you cannot read my mind, then listen to what I say.
143
144 Marijn Schouten (hkBst), Gentoo Lisp project, Gentoo ML
145 <http://www.gentoo.org/proj/en/lisp/>, #gentoo-{lisp,ml} on FreeNode
146 -----BEGIN PGP SIGNATURE-----
147 Version: GnuPG v2.0.11 (GNU/Linux)
148 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
149
150 iEYEARECAAYFAko6A7QACgkQp/VmCx0OL2wl/wCgpSNzob7skilge+56ynbmawHY
151 /1EAoJnOOG2Bix0IpWqySP063AJIWDta
152 =L9t+
153 -----END PGP SIGNATURE-----

Replies

Subject Author
Re: [packagekit] [gentoo-dev] Inviting you to project "PackageMap" Sebastian Pipping <webmaster@××××××××.org>