1 |
I got my git walker working under hadoop / starcluster on EC2. Sources are at: |
2 |
git://github.com/rich0/gitvalidate.git |
3 |
|
4 |
It can run fine serially from the shell - edit parestrees.py and |
5 |
maptree.py to fix the hard-coded path to the git repo. Parsetrees.py |
6 |
takes as an option how far down the tree to go - just enter a number > |
7 |
1.3M or so to get the whole thing, or less if you just want to test |
8 |
it. The output of that is the input to maptree.py. The output of |
9 |
maptree.py needs sorting, and then is the input to reducetree.py. |
10 |
|
11 |
This is an iterative algorithm, and I found about 12 rounds were |
12 |
needed. The first three are quite slow, and after that you're only |
13 |
really traversing the profiles which goes pretty fast. You can do it |
14 |
all at once with a big pipe parse | map | sort | reduce | map | sort | |
15 |
reduce, but as the number of commits included grows you'll need lots |
16 |
of temp space for the sorting. |
17 |
|
18 |
You can run that under hadoop with the streaming module. I ran it on |
19 |
ec2 with 6 m1.large instances and got through it all in about 2.5 |
20 |
hours or so. The first 2 rounds took over an hour each, after that it |
21 |
went pretty fast. That algorithm should parallelize up to maybe |
22 |
10-100k nodes, so it can run arbitrarily fast. The downside to hadoop |
23 |
is the hassle of getting that working, and that you need to move the |
24 |
1+GB git repository over to your cluster (which isn't terribly fast in |
25 |
my case - even with FIOS and EC2). That is a fixed overhead we really |
26 |
can't do much about. |
27 |
|
28 |
I'm going to set aside the git part of the problem for now as that is |
29 |
just optimization. Next steps would be to find some easy way to run |
30 |
that in parallel without hadoop. For anybody interested in taking a |
31 |
whack at that, you can chop up the input at line boundaries and run |
32 |
the map steps as parallel as you want to. You then need to cat/sort |
33 |
the output of all the maps. Then the output of that can be split |
34 |
going into the reduce step, but you have to keep lines with the same |
35 |
key together (the first tab-delimited column). I think I could |
36 |
probably do that on the cheap with some bash scripting and gnu |
37 |
parallel - just set up temp dirs for input/output and dump every job |
38 |
into a separate file. I suspect that Hadoop will be hard to beat if |
39 |
you want to cluster it, but for a single node you could make it a lot |
40 |
simpler. |
41 |
|
42 |
Next step is to get cvs into a similar format. My initial thoughts: |
43 |
|
44 |
1. Just run cvs log in the root, chop it up at file boundaries, |
45 |
base64 encode each blob, and dump that into a text file one file per |
46 |
line. |
47 |
2. Distribute processing of each file, turning it into one line per |
48 |
commit with all the info my git program dumps, save the file hash. |
49 |
3. Distribute reading those lines, checking out that one version of |
50 |
one file, calculating the hash, and outputting the full info. |
51 |
|
52 |
Steps 1/2 might be cheaper to just combine since you have to scan the |
53 |
whole thing to chop it up and the parsing can't be THAT expensive. |
54 |
|
55 |
If there are libs to make any of this easier I'm all ears, but it |
56 |
seems like there isn't much out there - nothing like pygit2. |
57 |
|
58 |
Once I have both I can start working on validation rules and perhaps |
59 |
get feedback to the conversion team. We'll need to work out what does |
60 |
and doesn't count as OK. We're doing transformation of data during |
61 |
migration, so I need to take that into account. Either the logic goes |
62 |
into the compare function, or the logic goes into the dump side so |
63 |
that the compares work out the same. Timestamps might force us to do |
64 |
logic during compare anyway. |
65 |
|
66 |
So, that's all for now. If anybody wants my notes for running hadoop |
67 |
let me know - maybe I'll digest those and make it into a blog entry. |
68 |
However, it would be worthwhile to see if we can ditch hadoop. If we |
69 |
are stuck with it then we can probably get the time down to 30min with |
70 |
< 100 nodes. |
71 |
|
72 |
Rich |