Gentoo Archives: gentoo-soc

From: Alec Warner <antarus@g.o>
To: gentoo-soc@l.g.o
Subject: Re: [gentoo-soc] Re: Gentoo stats server/client,
Date: Tue, 24 Mar 2009 05:00:11
Message-Id: b41005390903232200x3169609cy4331d06589cfaa7f@mail.gmail.com
In Reply to: Re: [gentoo-soc] Re: Gentoo stats server/client, by Joachim Bartosik
1 On Mon, Mar 23, 2009 at 4:40 PM, Joachim Bartosik <jbartosik@×××××.com> wrote:
2 > This idea looks interesting so if you don't mind I'll join the thread.
3 > I tried to cut everything short but it looks too long anyway :/ And trying
4 > too keep it short probably made some parts understandable so please ask.
5 > If you see * scroll down to end of email for explanations.
6 >
7 >
8 >> >> There have been many stats projects in the past that have failed due to
9 >> >> various reasons. A simple question is: How are you planning on making
10 >> >> your idea/proposal not fail? ;)
11 >
12 > By being lazy and putting as much work on others as possible.
13 >
14 > Authentication/ security overview
15 >
16 > The idea from 2006 ( to create account one has to ask for id and submit
17 > some data) makes usage very simple for users ( they don't even need to know
18 > anything about authorisation), but unluckily it's very easy to write
19 > "client" that would submit a lot data that would spoil the data ( I guess
20 > that's a major issue with authentication and security).
21
22 I'm not saying the prospective student has to necessarily tackle these
23 concerns. I just want to raise them as concerns because people
24 occasionally get bored and try to screw with online apps. I'd hope
25 that the gentoo-stats app is under most people's radar, but who knows.
26 If nothing else I'm hoping that some thought is given to
27 authenticating data; the whole point of this project is data
28 collection for the community and having bad data means a bad
29 experience for users and crappy data for us.
30
31 >
32 > To solve this problem I'd use less comfortable for users solution: user wold
33 > have to create an account using an email ( of course it wouldn't be stored,
34 > I'd store some one-way injective function of it*) and click an emailed link.
35 > There would be no need for password - to confirm his[her] actions [s]he
36 > would just click an emailed link.
37
38 I'm a bit concerned that if the function is one-way, how you will know
39 where to send these email links. Would the user have to input their
40 email address when making changes, and you gather it from the POST/GET
41 data?
42
43 >
44 > Each user ( email) would have a hosts** limit ( probably set in server
45 > configuration) 2 or 3 by default ( enough for average user, not enough to
46 > easily spoil data). After some time of inactivity host/ account would be
47 > removed.
48
49 Just make the host limit configurable and we can debate defaults
50 later. Certainly there should be a limit and there should be some way
51 to request more machines; we don't want users to create lots of
52 different accounts to contain their legitimate data. Retiring
53 inactive accounts is a good idea, +1
54
55 >
56 > The problem starts if one would need to get more hosts per account, right
57 > now I have some ideas ( none very good):
58 > - the easiest to implement method is "please email our admin and explain why
59 > do you need them" but it's user unfriendly and admin unfriendly.
60 > - give really big limit on the hosts per email - it would be easy to inject
61 > a lot of false data, but it's easier to remove then in 2006 auth( identify
62 > wrongdoing emails and delete their hosts).
63 > - require users to give some non-free ( free as in beer) email to reduce
64 > possibility of using fake emails and give big hosts limit.
65
66 This is where you start to, my work would call it, split the normal
67 users from the premier users. As long as the percentage of users who
68 need more than the default # of hosts is low, keeping it manual is OK.
69 What I might recommend is actually setting the limit fairly high at
70 first until you get your first batch of users, then look at like the
71 90th percentile of hosts per account and lower the max to that.
72
73 >
74 > I'd try to keep need to click email's to minimum - registration and
75 > administrative tasks ( like removing hosts from account).
76 >
77 >
78 >
79 > Components:
80 >
81 > Client
82 >
83 > Probably in python to take advantage of all the work portage developers have
84 > done and save me work. I'd be a simple run-me-from command line ( cron)
85 > program sending arch, all installed cpv and their USE ( for sure and before
86 > end of summer) and maybe some more if time allows ( "A daemon with 2 working
87 > modules is better than a daemon with 10 half finished ones."). Maybe [if
88 > time allows] GUI wrapper to run it in tray.
89
90 FYI, I have a working python client for portage/pkgcore/paludis that
91 collects this data and outputs some XML (that I was later going to
92 POST to a RESTful interface..that I never wrote ;p)
93
94 >
95 > Server:
96 > Would be split into several independent programs ( to save me work). All
97 > except first one would be written in python.
98 >
99 > User communication:
100 > Thanks for Rest idea - i thought about using HTML/ HTTPS but making it's
101 > stateless saves a lot of work. To save me some work I'd start with Apache +
102 > php + MySQL, one path per action ( register host, register user, send data,
103 > ...). It'd put received data in MySQL ( not verify if their correct, simply
104 > get data, and put it in table with information who and when sent it). It's
105 > not a very elegant solution ( and may turn out to be slow) so -if the time
106 > allows and there is need to- I'll rewrite it in python.
107
108 REST is HTTP, and you could do it with HTML, but it would be ugly ;p
109
110 this stuff sounds pretty standard, and I think if any prospective
111 students have developed webapps before its probably not very time
112 consuming. I would prefer python over PHP for maintainability within
113 gentoo, but if you feel most comfortable in PHP I'm not going to make
114 you use a specific language provided the code is readable.
115
116 >
117 > Data gathering:
118 > It'd take data provided by user communication module, decompress it, apply
119 > deltas etc. to create all-the-information-available about current state of
120 > hosts.
121
122 Here I'm thinking a reporting API that generates datasets. Someone
123 else (or you) can later render the report with javascript or
124 something.
125
126 >
127 > Cleaner:
128 > Run from time to time ( by cron, frequency adjusted to needs). Remove hosts
129 > and users that do not send data ( to conserve space) etc.
130
131 Probably a trivial add on feature provided we store last modification data.
132
133 >
134 > Achiever:
135 > Run from time to time ( cron, as needed). Data gathering provides only
136 > information about hosts *right now*. Achiever would generate statistics (
137 > like package popularity ( % hosts that installed it)) and store them to make
138 > historical data available ( storing all host states history would be
139 > extremely excessive).
140
141 I'll call this a stretch goal (eg we think about it but delay
142 implementation until the dataset gets fairly large).
143
144 >
145 > * The one-way part means that there is no easy way to get users emails even
146 > if someone gets access to the all data stored on server. The injective part
147 > means that no two emails will generate the same output, so no two users will
148 > get the same account. Hashes won't work because they are not injective
149 > functions but I'm almost sure someone already wrote functions like that. I
150 > don't recall any right now, but I'll have plenty of time to look for them or
151 > in worst case write one me self ( easy: create asymmetric pair of keys,
152 > throw private to /dev/null so none can decrypt it and encrypt emails with
153 > public one).
154 >
155 > ** 1 host == 1 data set ( installed packages, arch etc.)
156 >
157 >
158 >
159 > I realized I forgot to tell who am I:
160 >
161 > I'm Joachim live in Poland ( UTC + 1). Study mathematics ( 3rd year). Use
162 > Gentoo since 2005 ( as main desktop OS) or 2004 ( first contact). Code since
163 > 2003 ( training for/ participating in http://www.oi.edu.pl English version
164 > available - look in the right top corner) or 2001 ( started to play with
165 > Vbasic ). Cannot drink black tea. Right now extremely tired after sleepless
166 > weekend ( due to several breakdowns at home).
167 > Good night.
168 >

Replies

Subject Author
Re: [gentoo-soc] Re: Gentoo stats server/client, Joachim Bartosik <jbartosik@×××××.com>