1 |
-----BEGIN PGP SIGNED MESSAGE----- |
2 |
Hash: SHA1 |
3 |
|
4 |
Sebastian Pipping wrote: |
5 |
> Marijn Schouten (hkBst) wrote: |
6 |
>> Sebastian Pipping wrote: |
7 |
>>> I start to understand the real benefits of moving a larger |
8 |
>>> part of the maintenance down to the distro level as you proposed. |
9 |
>>> Okay, let's add support for CPEs at distro package level |
10 |
>>> and sync up and down with the central packagemap database. |
11 |
>>> Please contact me for collaboration on sync scripts |
12 |
>>> and "modeling" of details. |
13 |
>> Do we not already have enough information available to automatically determine |
14 |
>> derived unique identifiers like CPE? |
15 |
>> |
16 |
>> We have the package homepage and the package name (and the package category) and |
17 |
>> the combination should be enough information to do direct comparisons to data |
18 |
>> gathered from other repos (assuming they also contain such data). |
19 |
> |
20 |
> You are asking a valid question. The homepage links can be a great |
21 |
> helper in mapping and they have been of help already for the mapping |
22 |
> of the first 1000 Gentoo packages in packagemap. |
23 |
> |
24 |
> However it might not be as easy you make it sound, as there are |
25 |
> a few things that complicate things and produce extra work: |
26 |
> |
27 |
> - In many cases a project can be reached from several URLs. |
28 |
> For a project on SF.net you might have |
29 |
> - http://sf.net/projects/${name} |
30 |
> - http://${name}.sf.net/ |
31 |
> - http://www.${name}.org/ |
32 |
> That case can be handled rather easily but there are many more |
33 |
> special cases and a manual map may be required for stuff that's |
34 |
> not hosted on a larger hosting site. |
35 |
|
36 |
But homepage is just ONE of the things that help you to identify a package. Some |
37 |
packages that are the same will have different homepages and some packages which |
38 |
are different will have the same homepage. If you take just homepage, package |
39 |
name into account and the fact that packages from the same repo are different, |
40 |
you can probably match over 95% of all packages correctly. |
41 |
|
42 |
> - Split packages (think Git or Qt) may all have the same homepage. |
43 |
> In Debian the source package might help there, in Gentoo you'd |
44 |
> have to do common prefix detection or so, that's special |
45 |
> cases again, and continuous review that it still does what you need. |
46 |
|
47 |
Neither of the gits gentoo has seems very split, so I'll only address qt. Gentoo |
48 |
has qt-core and qt-svg (and many more). I would say that they would each have to |
49 |
get a different CPE and that none of them is equivalent to a package in another |
50 |
or the same distro that has all of qt combined. Packages that get manually split |
51 |
are a minority AFAIK, though texlive is another big one that comes to mind. |
52 |
Debian does splitting into ``normal'' and ``devel'' packages. Has it been |
53 |
decided what to do with those? |
54 |
Now that you got me thinking about split packages, I realize that the exact |
55 |
files installed by a package are also all by themselves a way to get over 95% |
56 |
correct matching. For distros (like Gentoo) that have packages that have flags |
57 |
that influence the list of installed files you must decide whether to add them |
58 |
to the database last, or whether you will try to use an imprecise file list. |
59 |
|
60 |
>> For example you can determine automatically that gentoo:dev-scheme/gambit and |
61 |
>> debian:gambc are the same package because although their names differ they have |
62 |
>> the same homepage and share a category. |
63 |
> |
64 |
> To detect equal categories you need a map for categories for all |
65 |
> participating distros. Yes, it's smaller than mapping all packages |
66 |
> but it involves a manual map and keeping it in sync. |
67 |
|
68 |
No, there need not be a manual mapping. There is no reason to do true/false |
69 |
comparisons. All we need is a distance function, like for example Levenshtein |
70 |
distance (http://en.wikipedia.org/wiki/Levenshtein_distance). Actually on second |
71 |
thought Levenshtein distance is probably not what we want, since we would be |
72 |
more interested in how much strings have in common than in how much they differ. |
73 |
I think the idea is clear though. |
74 |
|
75 |
> Another word on homepage collisions: A few days before I wrote |
76 |
> a script that builds a map from homepages to packagenames for the |
77 |
> whole Gentoo tree (code/gentoo/gentoo-world-to-homepage-map.sh). |
78 |
> The generated table from my run was 12330 lines long, each line for |
79 |
> a different package. |
80 |
> |
81 |
> If you run an analysis over that table you see that many |
82 |
> homepages appear many more times than just once. |
83 |
> Here's the top ten: |
84 |
> |
85 |
> 68 http://www.gnome.org/ |
86 |
> 67 http://www.gentoo.org/ |
87 |
> 58 http://www.gentoo.org/proj/en/perl/ |
88 |
> 42 http://lingucomponent.openoffice.org/ |
89 |
> 26 http://www.kde.org/ |
90 |
> 25 http://www.gentoo.org |
91 |
> 20 http://sourceforge.net/projects/synce/ |
92 |
> 19 http://www.trolltech.com/ |
93 |
> 19 http://search.cpan.org/~rjbs/ |
94 |
> 18 http://opensuse.foehr-it.de/ |
95 |
|
96 |
texlive with (http://www.tug.org/texlive/) seems to be missing from this list. |
97 |
|
98 |
$ eix -H http://www.tug.org/texlive/ | tail -n 1 |
99 |
Found 79 matches. |
100 |
|
101 |
I suspect you used grep (or whatever) to construct your data, instead of using |
102 |
the package manager or a tool that knows how to extract the data available in |
103 |
packages (and eclasses). |
104 |
|
105 |
> The command I used is |
106 |
> |
107 |
> $ sed 's| *.*$||' homepage-to-package.txt \ |
108 |
> | sort | uniq -c | sort -n -r | head -n 10 |
109 |
> |
110 |
> I think this three cases alone show that it would be |
111 |
|
112 |
I'm not sure which 3 cases you mean. |
113 |
|
114 |
> - also a lot of work |
115 |
> - be many special cases |
116 |
> - still require manual mappings here and there |
117 |
> |
118 |
> Another disadvantage is the current static XML approach of |
119 |
> packagemap is language independent. We can easily build |
120 |
> tools for packagemap in any language that has an XML parser. |
121 |
|
122 |
I agree that XML is a disadvantage, but not that it is language independent. ;P |
123 |
|
124 |
> If the data actually is the code we suddenly have to keep |
125 |
> code from different languages in precise special case sync. |
126 |
|
127 |
I did not argue for a data format nor for a specific language nor coding style |
128 |
nor anything that seems to match what you are saying here; I only spoke about |
129 |
how to populate the CPE database. |
130 |
|
131 |
> I'm not sure if the approach you describe is less work in total. |
132 |
> I guess to find out we'd have to do both in parallel :-) |
133 |
> |
134 |
> It could be interesting how much the list of homepages |
135 |
> in say Debian packages and Gentoo packages overlap. |
136 |
|
137 |
It would certainly be interesting. |
138 |
|
139 |
Marijn |
140 |
|
141 |
- -- |
142 |
If you cannot read my mind, then listen to what I say. |
143 |
|
144 |
Marijn Schouten (hkBst), Gentoo Lisp project, Gentoo ML |
145 |
<http://www.gentoo.org/proj/en/lisp/>, #gentoo-{lisp,ml} on FreeNode |
146 |
-----BEGIN PGP SIGNATURE----- |
147 |
Version: GnuPG v2.0.11 (GNU/Linux) |
148 |
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org |
149 |
|
150 |
iEYEARECAAYFAko6A7QACgkQp/VmCx0OL2wl/wCgpSNzob7skilge+56ynbmawHY |
151 |
/1EAoJnOOG2Bix0IpWqySP063AJIWDta |
152 |
=L9t+ |
153 |
-----END PGP SIGNATURE----- |