Gentoo Archives: gentoo-scm

From: Rich Freeman <rich0@g.o>
To: gentoo-scm@l.g.o
Subject: [gentoo-scm] Git Validation - Update
Date: Tue, 16 Oct 2012 17:59:31
Message-Id: CAGfcS_ncH2dRMH6bBWHpeCcMDLt+0_AKktf7rfOhPyqKnLH=9g@mail.gmail.com
1 I got my git walker working under hadoop / starcluster on EC2. Sources are at:
2 git://github.com/rich0/gitvalidate.git
3
4 It can run fine serially from the shell - edit parestrees.py and
5 maptree.py to fix the hard-coded path to the git repo. Parsetrees.py
6 takes as an option how far down the tree to go - just enter a number >
7 1.3M or so to get the whole thing, or less if you just want to test
8 it. The output of that is the input to maptree.py. The output of
9 maptree.py needs sorting, and then is the input to reducetree.py.
10
11 This is an iterative algorithm, and I found about 12 rounds were
12 needed. The first three are quite slow, and after that you're only
13 really traversing the profiles which goes pretty fast. You can do it
14 all at once with a big pipe parse | map | sort | reduce | map | sort |
15 reduce, but as the number of commits included grows you'll need lots
16 of temp space for the sorting.
17
18 You can run that under hadoop with the streaming module. I ran it on
19 ec2 with 6 m1.large instances and got through it all in about 2.5
20 hours or so. The first 2 rounds took over an hour each, after that it
21 went pretty fast. That algorithm should parallelize up to maybe
22 10-100k nodes, so it can run arbitrarily fast. The downside to hadoop
23 is the hassle of getting that working, and that you need to move the
24 1+GB git repository over to your cluster (which isn't terribly fast in
25 my case - even with FIOS and EC2). That is a fixed overhead we really
26 can't do much about.
27
28 I'm going to set aside the git part of the problem for now as that is
29 just optimization. Next steps would be to find some easy way to run
30 that in parallel without hadoop. For anybody interested in taking a
31 whack at that, you can chop up the input at line boundaries and run
32 the map steps as parallel as you want to. You then need to cat/sort
33 the output of all the maps. Then the output of that can be split
34 going into the reduce step, but you have to keep lines with the same
35 key together (the first tab-delimited column). I think I could
36 probably do that on the cheap with some bash scripting and gnu
37 parallel - just set up temp dirs for input/output and dump every job
38 into a separate file. I suspect that Hadoop will be hard to beat if
39 you want to cluster it, but for a single node you could make it a lot
40 simpler.
41
42 Next step is to get cvs into a similar format. My initial thoughts:
43
44 1. Just run cvs log in the root, chop it up at file boundaries,
45 base64 encode each blob, and dump that into a text file one file per
46 line.
47 2. Distribute processing of each file, turning it into one line per
48 commit with all the info my git program dumps, save the file hash.
49 3. Distribute reading those lines, checking out that one version of
50 one file, calculating the hash, and outputting the full info.
51
52 Steps 1/2 might be cheaper to just combine since you have to scan the
53 whole thing to chop it up and the parsing can't be THAT expensive.
54
55 If there are libs to make any of this easier I'm all ears, but it
56 seems like there isn't much out there - nothing like pygit2.
57
58 Once I have both I can start working on validation rules and perhaps
59 get feedback to the conversion team. We'll need to work out what does
60 and doesn't count as OK. We're doing transformation of data during
61 migration, so I need to take that into account. Either the logic goes
62 into the compare function, or the logic goes into the dump side so
63 that the compares work out the same. Timestamps might force us to do
64 logic during compare anyway.
65
66 So, that's all for now. If anybody wants my notes for running hadoop
67 let me know - maybe I'll digest those and make it into a blog entry.
68 However, it would be worthwhile to see if we can ditch hadoop. If we
69 are stuck with it then we can probably get the time down to 30min with
70 < 100 nodes.
71
72 Rich

Replies

Subject Author
Re: [gentoo-scm] Git Validation - Update Peter Stuge <peter@×××××.se>
Re: [gentoo-scm] Git Validation - Update Michael Haggerty <mhagger@××××××××.edu>
[gentoo-scm] Re: Git Validation - Update Rich Freeman <rich0@g.o>