1 |
On Tuesday 25 October 2005 09:23, Richard Freeman wrote: |
2 |
> John Myers wrote: |
3 |
> > I designed a system where it took feedback from consenting users, sending |
4 |
> > the file lists back to my server, were I was going to do some data |
5 |
> > crunching. The data from just _my_ system was over 60 MB. |
6 |
> |
7 |
> It sounds like you really only need to index each package a few times at |
8 |
> most. Sure, the raw data from a user could be 60MB each, but there are |
9 |
> some ways to reduce that significantly: |
10 |
Hm. I forgot to mention that the largest pieces (the file names and the |
11 |
md5sums) are only stored once, and then referenced with a relatively small |
12 |
integer (compared to the size of, say, a file name) |
13 |
|
14 |
Here's how it breaks down: |
15 |
table | rows | size |
16 |
----------------------+---------+-------- |
17 |
ebuilds | 994 | 118.3K |
18 |
filenames | 381,200 | 27.1M |
19 |
file info | 383,168 | 19.9M |
20 |
installations list | 1,007 | 26.7K |
21 |
extra install data | 1,007 | 88.2K |
22 |
file->install mapping | 464,193 | 13.1M |
23 |
|
24 |
There are some reinstallations and upgrades in the above data |
25 |
|
26 |
> 1. Don't send in data for anything in the base system install. |
27 |
> |
28 |
> 2. As you populate your database, publish a list of indexed packages |
29 |
> via a URL. Users would exclude any packages you've already indexed. If |
30 |
> this were a GLEP you could probably put the file in the portage |
31 |
> directory and everybody would get it via rsync. |
32 |
> |
33 |
> 3. Start by only indexing each package ONCE. Don't worry about every |
34 |
> combo of arches, CFLAGS, USE, etc. That means that most users wouldn't |
35 |
> upload anything at all, and the rest would only send their unique |
36 |
> contributions. |
37 |
Interesting thoughts |
38 |
|
39 |
> If you get everything working without indexing by USE, you could start |
40 |
> adding that capability in. Publish in #2 the list of USE flags indexed |
41 |
> for each package, and individuals would only upload packages compiled |
42 |
> with something that wasn't on that list. |
43 |
> |
44 |
> Sure, the final database could easily be 100MB or so, but if you just |
45 |
> put it on a website you won't be sending the whole thing. Just put it |
46 |
> in mysql/postgres and build a php front end (sorry, not a web dev |
47 |
> personally, but it isn't that hard to do from the little I've messed |
48 |
> with it). |
49 |
that's what the intention was. Maybe with an XML-RPC service for a |
50 |
command-line client to use. And the data is stored in a mysql database |
51 |
> |
52 |
> Sorry - I don't intend to make it sound like the whole thing can be done |
53 |
> in 5 minutes, and I"m sure you've already poured hours into your effort. |
54 |
> However, I don't see any theoretical issues with it as long as the |
55 |
> design is right. The important thing is that users are only uploading |
56 |
> diffs against your master repository - and not doing a complete dump of |
57 |
> their entire system. Otherwise you will get buried in data! |
58 |
The biggest problem is that there are a lot of potential variations, and they |
59 |
all really need to be there for this to be useful |
60 |
> |
61 |
> I must admit that it is easy to just talk about ideas like this - I |
62 |
> really do want to commend you on the work you've undoubtedly already |
63 |
> accomplished! OSS projects require lots of hard work by many volunteers |
64 |
> and it is all too easy for people like me to just sit back and nitpick |
65 |
> what could be done better... |
66 |
Well, I think I might hack around on this a little more |