1 |
On Sun, Oct 14, 2012 at 8:08 PM, Brian Harring <ferringb@×××××.com> wrote: |
2 |
> Probably not the answer you want... but I suggest you valgrind it. If |
3 |
> the behaviour is differing for Repository()[some-hash], that makes me |
4 |
> think there is a ref counting bug afoot. |
5 |
|
6 |
Ugh. |
7 |
|
8 |
> |
9 |
> Also, wtf is dumbo? |
10 |
|
11 |
It is a simplified framework for doing parallel jobs that can run on |
12 |
hadoop. That said, after finding some documentation on just running |
13 |
python on hadoop directly I'm considering just going that route |
14 |
(should be easy to change). |
15 |
|
16 |
> And can you clarify how this is doing it's |
17 |
> validation, rough runtime, etc? |
18 |
|
19 |
See my earlier posts on this list. Basically the goal is to break |
20 |
both git/cvs into sorted lists of files and their individual histories |
21 |
and then compare them. They would be one commit per line. Due to how |
22 |
Hadoop works I'm just using csv with a few fields in base64. |
23 |
|
24 |
If you're looking at the python file I linked to, keep in mind that |
25 |
what it does is takes in a list of trees/blobs in such a format, |
26 |
decends one level on each, removes duplicates, and outputs a new list |
27 |
of trees/blobs. That would be itereated. Since each commit typically |
28 |
only changes a few files at the bottom of the tree it the list would |
29 |
not be expected to grow much on each iteration (all but one entry at |
30 |
the top level of the tree would be a duplicate from one commit to the |
31 |
next). However, if several things did change they'd end up in the |
32 |
list. |
33 |
|
34 |
This treatment allows each record to be treated entirely in parallel. |
35 |
You start out with 1.2M of them which is how many commits there are in |
36 |
the git repository (at least the one I'm testing on), grows 5-40x |
37 |
after the map step, and shrinks to just over 1x on the reduce step. |
38 |
When you run a step and either end up with all blobs or unchanged |
39 |
output you know you're done (not sure what the fastest way to detect |
40 |
that condition is - the list isn't THAT big so grep is probably good |
41 |
enough on the interim files). |
42 |
|
43 |
> Minimally, it's going to need to go |
44 |
> parallel- I rebuilt the conversion bits (parallelized along category |
45 |
> lines, then reintegrating after the fact), and it's around 50m for |
46 |
> run- validation being equivalent would definitely be preferable. |
47 |
|
48 |
That was my concern as well. The solution is parallel and I'm looking |
49 |
to run it on hadoop. In theory the cvs side should be easier to do - |
50 |
I'm running into more issues with the tools than the software. It |
51 |
works just fine on a few thousand commits with my last design. |
52 |
Actually, once I ditch the dumbo framework I might see how it does |
53 |
without a cluster - I was running into RAM issues before and the new |
54 |
design would operate as a pipe. As long as you chop up the input file |
55 |
at line boundaries and feed it to the map step and concatenate/sort |
56 |
the output of the map step to feed it to reduce anything that runs the |
57 |
script in parallel would work. |
58 |
|
59 |
In fact, one of my concerns is just the time it takes to transfer that |
60 |
2GB git repository over to a cluster and get everything running - |
61 |
especially if you've got the conversion time down to 50m (I heard we |
62 |
were at 8h previously). I was hoping to have the testing done in |
63 |
under an hour. |
64 |
|
65 |
Shouldn't take more than a few days for me to get around to modifying |
66 |
my script to not require dumbo and just testing it out on datasets in |
67 |
serial. If I can get the speed up we might just be able to run it on |
68 |
whatever host you're using for the conversion and that would eliminate |
69 |
a lot of hassle. |
70 |
|
71 |
> |
72 |
> @mhagger; details of that I'll share in a bit- roughly, it's |
73 |
> exploiting the fact gentoo-x86 /never/ has cross category commits (no |
74 |
> one does repo wide commits, although detection/merging of that I'm |
75 |
> checking for). |
76 |
|
77 |
Well, my validation only checks that the per-file history is correct, |
78 |
which I think is all cvs really lets you do anyway. So, it won't |
79 |
distinguish between a->a' then b->b' vs (a,b)->(a',b'). However, if |
80 |
your code is somehow losing changes it will certainly find that. |
81 |
Also, if the timestamps on two commits are the same it will not |
82 |
distinguish between the order of those two commits, but I don't think |
83 |
that info is really in cvs either. I haven't gotten far enough to |
84 |
discover if there are any issues with time resolution/conversion/etc. |
85 |
|
86 |
I'm certainly interested in seeing what you came up with. I take it |
87 |
that you're writing something from scratch here? |
88 |
|
89 |
Rich |