Gentoo Archives: gentoo-amd64

From:	John Myers <electronerd@××××××××××.com>
To:	gentoo-amd64@l.g.o
Subject:	Re: [gentoo-amd64] dig package
Date:	Wed, 26 Oct 2005 06:46:19
Message-Id:	`200510252211.31741.electronerd@monolith3d.com`
In Reply to:	Re: [gentoo-amd64] dig package by Richard Freeman

1	On Tuesday 25 October 2005 09:23, Richard Freeman wrote:
2	> John Myers wrote:
3	> > I designed a system where it took feedback from consenting users, sending
4	> > the file lists back to my server, were I was going to do some data
5	> > crunching. The data from just _my_ system was over 60 MB.
6	>
7	> It sounds like you really only need to index each package a few times at
8	> most. Sure, the raw data from a user could be 60MB each, but there are
9	> some ways to reduce that significantly:
10	Hm. I forgot to mention that the largest pieces (the file names and the
11	md5sums) are only stored once, and then referenced with a relatively small
12	integer (compared to the size of, say, a file name)
13
14	Here's how it breaks down:
15	table \| rows \| size
16	----------------------+---------+--------
17	ebuilds \| 994 \| 118.3K
18	filenames \| 381,200 \| 27.1M
19	file info \| 383,168 \| 19.9M
20	installations list \| 1,007 \| 26.7K
21	extra install data \| 1,007 \| 88.2K
22	file->install mapping \| 464,193 \| 13.1M
23
24	There are some reinstallations and upgrades in the above data
25
26	> 1. Don't send in data for anything in the base system install.
27	>
28	> 2. As you populate your database, publish a list of indexed packages
29	> via a URL. Users would exclude any packages you've already indexed. If
30	> this were a GLEP you could probably put the file in the portage
31	> directory and everybody would get it via rsync.
32	>
33	> 3. Start by only indexing each package ONCE. Don't worry about every
34	> combo of arches, CFLAGS, USE, etc. That means that most users wouldn't
35	> upload anything at all, and the rest would only send their unique
36	> contributions.
37	Interesting thoughts
38
39	> If you get everything working without indexing by USE, you could start
40	> adding that capability in. Publish in #2 the list of USE flags indexed
41	> for each package, and individuals would only upload packages compiled
42	> with something that wasn't on that list.
43	>
44	> Sure, the final database could easily be 100MB or so, but if you just
45	> put it on a website you won't be sending the whole thing. Just put it
46	> in mysql/postgres and build a php front end (sorry, not a web dev
47	> personally, but it isn't that hard to do from the little I've messed
48	> with it).
49	that's what the intention was. Maybe with an XML-RPC service for a
50	command-line client to use. And the data is stored in a mysql database
51	>
52	> Sorry - I don't intend to make it sound like the whole thing can be done
53	> in 5 minutes, and I"m sure you've already poured hours into your effort.
54	> However, I don't see any theoretical issues with it as long as the
55	> design is right. The important thing is that users are only uploading
56	> diffs against your master repository - and not doing a complete dump of
57	> their entire system. Otherwise you will get buried in data!
58	The biggest problem is that there are a lot of potential variations, and they
59	all really need to be there for this to be useful
60	>
61	> I must admit that it is easy to just talk about ideas like this - I
62	> really do want to commend you on the work you've undoubtedly already
63	> accomplished! OSS projects require lots of hard work by many volunteers
64	> and it is all too easy for people like me to just sit back and nitpick
65	> what could be done better...
66	Well, I think I might hack around on this a little more

Replies

Subject	Author
Re: [gentoo-amd64] dig package	Billy Holmes <billy@××××××.net>

Report Message

Find on MARC Find on Google Groups