1 |
On Mon, Mar 23, 2009 at 4:40 PM, Joachim Bartosik <jbartosik@×××××.com> wrote: |
2 |
> This idea looks interesting so if you don't mind I'll join the thread. |
3 |
> I tried to cut everything short but it looks too long anyway :/ And trying |
4 |
> too keep it short probably made some parts understandable so please ask. |
5 |
> If you see * scroll down to end of email for explanations. |
6 |
> |
7 |
> |
8 |
>> >> There have been many stats projects in the past that have failed due to |
9 |
>> >> various reasons. A simple question is: How are you planning on making |
10 |
>> >> your idea/proposal not fail? ;) |
11 |
> |
12 |
> By being lazy and putting as much work on others as possible. |
13 |
> |
14 |
> Authentication/ security overview |
15 |
> |
16 |
> The idea from 2006 ( to create account one has to ask for id and submit |
17 |
> some data) makes usage very simple for users ( they don't even need to know |
18 |
> anything about authorisation), but unluckily it's very easy to write |
19 |
> "client" that would submit a lot data that would spoil the data ( I guess |
20 |
> that's a major issue with authentication and security). |
21 |
|
22 |
I'm not saying the prospective student has to necessarily tackle these |
23 |
concerns. I just want to raise them as concerns because people |
24 |
occasionally get bored and try to screw with online apps. I'd hope |
25 |
that the gentoo-stats app is under most people's radar, but who knows. |
26 |
If nothing else I'm hoping that some thought is given to |
27 |
authenticating data; the whole point of this project is data |
28 |
collection for the community and having bad data means a bad |
29 |
experience for users and crappy data for us. |
30 |
|
31 |
> |
32 |
> To solve this problem I'd use less comfortable for users solution: user wold |
33 |
> have to create an account using an email ( of course it wouldn't be stored, |
34 |
> I'd store some one-way injective function of it*) and click an emailed link. |
35 |
> There would be no need for password - to confirm his[her] actions [s]he |
36 |
> would just click an emailed link. |
37 |
|
38 |
I'm a bit concerned that if the function is one-way, how you will know |
39 |
where to send these email links. Would the user have to input their |
40 |
email address when making changes, and you gather it from the POST/GET |
41 |
data? |
42 |
|
43 |
> |
44 |
> Each user ( email) would have a hosts** limit ( probably set in server |
45 |
> configuration) 2 or 3 by default ( enough for average user, not enough to |
46 |
> easily spoil data). After some time of inactivity host/ account would be |
47 |
> removed. |
48 |
|
49 |
Just make the host limit configurable and we can debate defaults |
50 |
later. Certainly there should be a limit and there should be some way |
51 |
to request more machines; we don't want users to create lots of |
52 |
different accounts to contain their legitimate data. Retiring |
53 |
inactive accounts is a good idea, +1 |
54 |
|
55 |
> |
56 |
> The problem starts if one would need to get more hosts per account, right |
57 |
> now I have some ideas ( none very good): |
58 |
> - the easiest to implement method is "please email our admin and explain why |
59 |
> do you need them" but it's user unfriendly and admin unfriendly. |
60 |
> - give really big limit on the hosts per email - it would be easy to inject |
61 |
> a lot of false data, but it's easier to remove then in 2006 auth( identify |
62 |
> wrongdoing emails and delete their hosts). |
63 |
> - require users to give some non-free ( free as in beer) email to reduce |
64 |
> possibility of using fake emails and give big hosts limit. |
65 |
|
66 |
This is where you start to, my work would call it, split the normal |
67 |
users from the premier users. As long as the percentage of users who |
68 |
need more than the default # of hosts is low, keeping it manual is OK. |
69 |
What I might recommend is actually setting the limit fairly high at |
70 |
first until you get your first batch of users, then look at like the |
71 |
90th percentile of hosts per account and lower the max to that. |
72 |
|
73 |
> |
74 |
> I'd try to keep need to click email's to minimum - registration and |
75 |
> administrative tasks ( like removing hosts from account). |
76 |
> |
77 |
> |
78 |
> |
79 |
> Components: |
80 |
> |
81 |
> Client |
82 |
> |
83 |
> Probably in python to take advantage of all the work portage developers have |
84 |
> done and save me work. I'd be a simple run-me-from command line ( cron) |
85 |
> program sending arch, all installed cpv and their USE ( for sure and before |
86 |
> end of summer) and maybe some more if time allows ( "A daemon with 2 working |
87 |
> modules is better than a daemon with 10 half finished ones."). Maybe [if |
88 |
> time allows] GUI wrapper to run it in tray. |
89 |
|
90 |
FYI, I have a working python client for portage/pkgcore/paludis that |
91 |
collects this data and outputs some XML (that I was later going to |
92 |
POST to a RESTful interface..that I never wrote ;p) |
93 |
|
94 |
> |
95 |
> Server: |
96 |
> Would be split into several independent programs ( to save me work). All |
97 |
> except first one would be written in python. |
98 |
> |
99 |
> User communication: |
100 |
> Thanks for Rest idea - i thought about using HTML/ HTTPS but making it's |
101 |
> stateless saves a lot of work. To save me some work I'd start with Apache + |
102 |
> php + MySQL, one path per action ( register host, register user, send data, |
103 |
> ...). It'd put received data in MySQL ( not verify if their correct, simply |
104 |
> get data, and put it in table with information who and when sent it). It's |
105 |
> not a very elegant solution ( and may turn out to be slow) so -if the time |
106 |
> allows and there is need to- I'll rewrite it in python. |
107 |
|
108 |
REST is HTTP, and you could do it with HTML, but it would be ugly ;p |
109 |
|
110 |
this stuff sounds pretty standard, and I think if any prospective |
111 |
students have developed webapps before its probably not very time |
112 |
consuming. I would prefer python over PHP for maintainability within |
113 |
gentoo, but if you feel most comfortable in PHP I'm not going to make |
114 |
you use a specific language provided the code is readable. |
115 |
|
116 |
> |
117 |
> Data gathering: |
118 |
> It'd take data provided by user communication module, decompress it, apply |
119 |
> deltas etc. to create all-the-information-available about current state of |
120 |
> hosts. |
121 |
|
122 |
Here I'm thinking a reporting API that generates datasets. Someone |
123 |
else (or you) can later render the report with javascript or |
124 |
something. |
125 |
|
126 |
> |
127 |
> Cleaner: |
128 |
> Run from time to time ( by cron, frequency adjusted to needs). Remove hosts |
129 |
> and users that do not send data ( to conserve space) etc. |
130 |
|
131 |
Probably a trivial add on feature provided we store last modification data. |
132 |
|
133 |
> |
134 |
> Achiever: |
135 |
> Run from time to time ( cron, as needed). Data gathering provides only |
136 |
> information about hosts *right now*. Achiever would generate statistics ( |
137 |
> like package popularity ( % hosts that installed it)) and store them to make |
138 |
> historical data available ( storing all host states history would be |
139 |
> extremely excessive). |
140 |
|
141 |
I'll call this a stretch goal (eg we think about it but delay |
142 |
implementation until the dataset gets fairly large). |
143 |
|
144 |
> |
145 |
> * The one-way part means that there is no easy way to get users emails even |
146 |
> if someone gets access to the all data stored on server. The injective part |
147 |
> means that no two emails will generate the same output, so no two users will |
148 |
> get the same account. Hashes won't work because they are not injective |
149 |
> functions but I'm almost sure someone already wrote functions like that. I |
150 |
> don't recall any right now, but I'll have plenty of time to look for them or |
151 |
> in worst case write one me self ( easy: create asymmetric pair of keys, |
152 |
> throw private to /dev/null so none can decrypt it and encrypt emails with |
153 |
> public one). |
154 |
> |
155 |
> ** 1 host == 1 data set ( installed packages, arch etc.) |
156 |
> |
157 |
> |
158 |
> |
159 |
> I realized I forgot to tell who am I: |
160 |
> |
161 |
> I'm Joachim live in Poland ( UTC + 1). Study mathematics ( 3rd year). Use |
162 |
> Gentoo since 2005 ( as main desktop OS) or 2004 ( first contact). Code since |
163 |
> 2003 ( training for/ participating in http://www.oi.edu.pl English version |
164 |
> available - look in the right top corner) or 2001 ( started to play with |
165 |
> Vbasic ). Cannot drink black tea. Right now extremely tired after sleepless |
166 |
> weekend ( due to several breakdowns at home). |
167 |
> Good night. |
168 |
> |