Gentoo Archives: gentoo-scm

From:	Rich Freeman <rich0@g.o>
To:	gentoo-scm@l.g.o
Subject:	Re: [gentoo-scm] Bizarre Python Issue with Validation
Date:	Mon, 15 Oct 2012 11:14:00
Message-Id:	`CAGfcS_=ZnpDXV8VXEMxkFXLkR13tNmGOiAQxdYNsbFqTA-J9sA@mail.gmail.com`
In Reply to:	Re: [gentoo-scm] Bizarre Python Issue with Validation by Brian Harring

1	On Sun, Oct 14, 2012 at 8:08 PM, Brian Harring <ferringb@×××××.com> wrote:
2	> Probably not the answer you want... but I suggest you valgrind it. If
3	> the behaviour is differing for Repository()[some-hash], that makes me
4	> think there is a ref counting bug afoot.
5
6	Ugh.
7
8	>
9	> Also, wtf is dumbo?
10
11	It is a simplified framework for doing parallel jobs that can run on
12	hadoop. That said, after finding some documentation on just running
13	python on hadoop directly I'm considering just going that route
14	(should be easy to change).
15
16	> And can you clarify how this is doing it's
17	> validation, rough runtime, etc?
18
19	See my earlier posts on this list. Basically the goal is to break
20	both git/cvs into sorted lists of files and their individual histories
21	and then compare them. They would be one commit per line. Due to how
22	Hadoop works I'm just using csv with a few fields in base64.
23
24	If you're looking at the python file I linked to, keep in mind that
25	what it does is takes in a list of trees/blobs in such a format,
26	decends one level on each, removes duplicates, and outputs a new list
27	of trees/blobs. That would be itereated. Since each commit typically
28	only changes a few files at the bottom of the tree it the list would
29	not be expected to grow much on each iteration (all but one entry at
30	the top level of the tree would be a duplicate from one commit to the
31	next). However, if several things did change they'd end up in the
32	list.
33
34	This treatment allows each record to be treated entirely in parallel.
35	You start out with 1.2M of them which is how many commits there are in
36	the git repository (at least the one I'm testing on), grows 5-40x
37	after the map step, and shrinks to just over 1x on the reduce step.
38	When you run a step and either end up with all blobs or unchanged
39	output you know you're done (not sure what the fastest way to detect
40	that condition is - the list isn't THAT big so grep is probably good
41	enough on the interim files).
42
43	> Minimally, it's going to need to go
44	> parallel- I rebuilt the conversion bits (parallelized along category
45	> lines, then reintegrating after the fact), and it's around 50m for
46	> run- validation being equivalent would definitely be preferable.
47
48	That was my concern as well. The solution is parallel and I'm looking
49	to run it on hadoop. In theory the cvs side should be easier to do -
50	I'm running into more issues with the tools than the software. It
51	works just fine on a few thousand commits with my last design.
52	Actually, once I ditch the dumbo framework I might see how it does
53	without a cluster - I was running into RAM issues before and the new
54	design would operate as a pipe. As long as you chop up the input file
55	at line boundaries and feed it to the map step and concatenate/sort
56	the output of the map step to feed it to reduce anything that runs the
57	script in parallel would work.
58
59	In fact, one of my concerns is just the time it takes to transfer that
60	2GB git repository over to a cluster and get everything running -
61	especially if you've got the conversion time down to 50m (I heard we
62	were at 8h previously). I was hoping to have the testing done in
63	under an hour.
64
65	Shouldn't take more than a few days for me to get around to modifying
66	my script to not require dumbo and just testing it out on datasets in
67	serial. If I can get the speed up we might just be able to run it on
68	whatever host you're using for the conversion and that would eliminate
69	a lot of hassle.
70
71	>
72	> @mhagger; details of that I'll share in a bit- roughly, it's
73	> exploiting the fact gentoo-x86 /never/ has cross category commits (no
74	> one does repo wide commits, although detection/merging of that I'm
75	> checking for).
76
77	Well, my validation only checks that the per-file history is correct,
78	which I think is all cvs really lets you do anyway. So, it won't
79	distinguish between a->a' then b->b' vs (a,b)->(a',b'). However, if
80	your code is somehow losing changes it will certainly find that.
81	Also, if the timestamps on two commits are the same it will not
82	distinguish between the order of those two commits, but I don't think
83	that info is really in cvs either. I haven't gotten far enough to
84	discover if there are any issues with time resolution/conversion/etc.
85
86	I'm certainly interested in seeing what you came up with. I take it
87	that you're writing something from scratch here?
88
89	Rich

Report Message

Find on MARC Find on Google Groups