On Mon, Mar 23, 2009 at 4:40 PM, Joachim Bartosik <jbartosik@...> wrote:
> This idea looks interesting so if you don't mind I'll join the thread.
> I tried to cut everything short but it looks too long anyway :/ And trying
> too keep it short probably made some parts understandable so please ask.
> If you see * scroll down to end of email for explanations.
>
>
>> >> There have been many stats projects in the past that have failed due to
>> >> various reasons. A simple question is: How are you planning on making
>> >> your idea/proposal not fail? ;)
>
> By being lazy and putting as much work on others as possible.
>
> Authentication/ security overview
>
> The idea from 2006 ( to create account one has to ask for id and submit
> some data) makes usage very simple for users ( they don't even need to know
> anything about authorisation), but unluckily it's very easy to write
> "client" that would submit a lot data that would spoil the data ( I guess
> that's a major issue with authentication and security).
I'm not saying the prospective student has to necessarily tackle these
concerns. I just want to raise them as concerns because people
occasionally get bored and try to screw with online apps. I'd hope
that the gentoo-stats app is under most people's radar, but who knows.
If nothing else I'm hoping that some thought is given to
authenticating data; the whole point of this project is data
collection for the community and having bad data means a bad
experience for users and crappy data for us.
>
> To solve this problem I'd use less comfortable for users solution: user wold
> have to create an account using an email ( of course it wouldn't be stored,
> I'd store some one-way injective function of it*) and click an emailed link.
> There would be no need for password - to confirm his[her] actions [s]he
> would just click an emailed link.
I'm a bit concerned that if the function is one-way, how you will know
where to send these email links. Would the user have to input their
email address when making changes, and you gather it from the POST/GET
data?
>
> Each user ( email) would have a hosts** limit ( probably set in server
> configuration) 2 or 3 by default ( enough for average user, not enough to
> easily spoil data). After some time of inactivity host/ account would be
> removed.
Just make the host limit configurable and we can debate defaults
later. Certainly there should be a limit and there should be some way
to request more machines; we don't want users to create lots of
different accounts to contain their legitimate data. Retiring
inactive accounts is a good idea, +1
>
> The problem starts if one would need to get more hosts per account, right
> now I have some ideas ( none very good):
> - the easiest to implement method is "please email our admin and explain why
> do you need them" but it's user unfriendly and admin unfriendly.
> - give really big limit on the hosts per email - it would be easy to inject
> a lot of false data, but it's easier to remove then in 2006 auth( identify
> wrongdoing emails and delete their hosts).
> - require users to give some non-free ( free as in beer) email to reduce
> possibility of using fake emails and give big hosts limit.
This is where you start to, my work would call it, split the normal
users from the premier users. As long as the percentage of users who
need more than the default # of hosts is low, keeping it manual is OK.
What I might recommend is actually setting the limit fairly high at
first until you get your first batch of users, then look at like the
90th percentile of hosts per account and lower the max to that.
>
> I'd try to keep need to click email's to minimum - registration and
> administrative tasks ( like removing hosts from account).
>
>
>
> Components:
>
> Client
>
> Probably in python to take advantage of all the work portage developers have
> done and save me work. I'd be a simple run-me-from command line ( cron)
> program sending arch, all installed cpv and their USE ( for sure and before
> end of summer) and maybe some more if time allows ( "A daemon with 2 working
> modules is better than a daemon with 10 half finished ones."). Maybe [if
> time allows] GUI wrapper to run it in tray.
FYI, I have a working python client for portage/pkgcore/paludis that
collects this data and outputs some XML (that I was later going to
POST to a RESTful interface..that I never wrote ;p)
>
> Server:
> Would be split into several independent programs ( to save me work). All
> except first one would be written in python.
>
> User communication:
> Thanks for Rest idea - i thought about using HTML/ HTTPS but making it's
> stateless saves a lot of work. To save me some work I'd start with Apache +
> php + MySQL, one path per action ( register host, register user, send data,
> ...). It'd put received data in MySQL ( not verify if their correct, simply
> get data, and put it in table with information who and when sent it). It's
> not a very elegant solution ( and may turn out to be slow) so -if the time
> allows and there is need to- I'll rewrite it in python.
REST is HTTP, and you could do it with HTML, but it would be ugly ;p
this stuff sounds pretty standard, and I think if any prospective
students have developed webapps before its probably not very time
consuming. I would prefer python over PHP for maintainability within
gentoo, but if you feel most comfortable in PHP I'm not going to make
you use a specific language provided the code is readable.
>
> Data gathering:
> It'd take data provided by user communication module, decompress it, apply
> deltas etc. to create all-the-information-available about current state of
> hosts.
Here I'm thinking a reporting API that generates datasets. Someone
else (or you) can later render the report with javascript or
something.
>
> Cleaner:
> Run from time to time ( by cron, frequency adjusted to needs). Remove hosts
> and users that do not send data ( to conserve space) etc.
Probably a trivial add on feature provided we store last modification data.
>
> Achiever:
> Run from time to time ( cron, as needed). Data gathering provides only
> information about hosts *right now*. Achiever would generate statistics (
> like package popularity ( % hosts that installed it)) and store them to make
> historical data available ( storing all host states history would be
> extremely excessive).
I'll call this a stretch goal (eg we think about it but delay
implementation until the dataset gets fairly large).
>
> * The one-way part means that there is no easy way to get users emails even
> if someone gets access to the all data stored on server. The injective part
> means that no two emails will generate the same output, so no two users will
> get the same account. Hashes won't work because they are not injective
> functions but I'm almost sure someone already wrote functions like that. I
> don't recall any right now, but I'll have plenty of time to look for them or
> in worst case write one me self ( easy: create asymmetric pair of keys,
> throw private to /dev/null so none can decrypt it and encrypt emails with
> public one).
>
> ** 1 host == 1 data set ( installed packages, arch etc.)
>
>
>
> I realized I forgot to tell who am I:
>
> I'm Joachim live in Poland ( UTC + 1). Study mathematics ( 3rd year). Use
> Gentoo since 2005 ( as main desktop OS) or 2004 ( first contact). Code since
> 2003 ( training for/ participating in http://www.oi.edu.pl English version
> available - look in the right top corner) or 2001 ( started to play with
> Vbasic ). Cannot drink black tea. Right now extremely tired after sleepless
> weekend ( due to several breakdowns at home).
> Good night.
>
|