Gentoo Archives: gentoo-amd64

From: John Myers <electronerd@××××××××××.com>
To: gentoo-amd64@l.g.o
Subject: Re: [gentoo-amd64] dig package
Date: Wed, 26 Oct 2005 06:46:19
Message-Id: 200510252211.31741.electronerd@monolith3d.com
In Reply to: Re: [gentoo-amd64] dig package by Richard Freeman
1 On Tuesday 25 October 2005 09:23, Richard Freeman wrote:
2 > John Myers wrote:
3 > > I designed a system where it took feedback from consenting users, sending
4 > > the file lists back to my server, were I was going to do some data
5 > > crunching. The data from just _my_ system was over 60 MB.
6 >
7 > It sounds like you really only need to index each package a few times at
8 > most. Sure, the raw data from a user could be 60MB each, but there are
9 > some ways to reduce that significantly:
10 Hm. I forgot to mention that the largest pieces (the file names and the
11 md5sums) are only stored once, and then referenced with a relatively small
12 integer (compared to the size of, say, a file name)
13
14 Here's how it breaks down:
15 table | rows | size
16 ----------------------+---------+--------
17 ebuilds | 994 | 118.3K
18 filenames | 381,200 | 27.1M
19 file info | 383,168 | 19.9M
20 installations list | 1,007 | 26.7K
21 extra install data | 1,007 | 88.2K
22 file->install mapping | 464,193 | 13.1M
23
24 There are some reinstallations and upgrades in the above data
25
26 > 1. Don't send in data for anything in the base system install.
27 >
28 > 2. As you populate your database, publish a list of indexed packages
29 > via a URL. Users would exclude any packages you've already indexed. If
30 > this were a GLEP you could probably put the file in the portage
31 > directory and everybody would get it via rsync.
32 >
33 > 3. Start by only indexing each package ONCE. Don't worry about every
34 > combo of arches, CFLAGS, USE, etc. That means that most users wouldn't
35 > upload anything at all, and the rest would only send their unique
36 > contributions.
37 Interesting thoughts
38
39 > If you get everything working without indexing by USE, you could start
40 > adding that capability in. Publish in #2 the list of USE flags indexed
41 > for each package, and individuals would only upload packages compiled
42 > with something that wasn't on that list.
43 >
44 > Sure, the final database could easily be 100MB or so, but if you just
45 > put it on a website you won't be sending the whole thing. Just put it
46 > in mysql/postgres and build a php front end (sorry, not a web dev
47 > personally, but it isn't that hard to do from the little I've messed
48 > with it).
49 that's what the intention was. Maybe with an XML-RPC service for a
50 command-line client to use. And the data is stored in a mysql database
51 >
52 > Sorry - I don't intend to make it sound like the whole thing can be done
53 > in 5 minutes, and I"m sure you've already poured hours into your effort.
54 > However, I don't see any theoretical issues with it as long as the
55 > design is right. The important thing is that users are only uploading
56 > diffs against your master repository - and not doing a complete dump of
57 > their entire system. Otherwise you will get buried in data!
58 The biggest problem is that there are a lot of potential variations, and they
59 all really need to be there for this to be useful
60 >
61 > I must admit that it is easy to just talk about ideas like this - I
62 > really do want to commend you on the work you've undoubtedly already
63 > accomplished! OSS projects require lots of hard work by many volunteers
64 > and it is all too easy for people like me to just sit back and nitpick
65 > what could be done better...
66 Well, I think I might hack around on this a little more

Replies

Subject Author
Re: [gentoo-amd64] dig package Billy Holmes <billy@××××××.net>