Gentoo Archives: gentoo-scm

From:	Rich Freeman <rich0@g.o>
To:	gentoo-scm@l.g.o
Subject:	[gentoo-scm] Git Validation - Update
Date:	Tue, 16 Oct 2012 17:59:31
Message-Id:	`CAGfcS_ncH2dRMH6bBWHpeCcMDLt+0_AKktf7rfOhPyqKnLH=9g@mail.gmail.com`

1	I got my git walker working under hadoop / starcluster on EC2. Sources are at:
2	git://github.com/rich0/gitvalidate.git
3
4	It can run fine serially from the shell - edit parestrees.py and
5	maptree.py to fix the hard-coded path to the git repo. Parsetrees.py
6	takes as an option how far down the tree to go - just enter a number >
7	1.3M or so to get the whole thing, or less if you just want to test
8	it. The output of that is the input to maptree.py. The output of
9	maptree.py needs sorting, and then is the input to reducetree.py.
10
11	This is an iterative algorithm, and I found about 12 rounds were
12	needed. The first three are quite slow, and after that you're only
13	really traversing the profiles which goes pretty fast. You can do it
14	all at once with a big pipe parse \| map \| sort \| reduce \| map \| sort \|
15	reduce, but as the number of commits included grows you'll need lots
16	of temp space for the sorting.
17
18	You can run that under hadoop with the streaming module. I ran it on
19	ec2 with 6 m1.large instances and got through it all in about 2.5
20	hours or so. The first 2 rounds took over an hour each, after that it
21	went pretty fast. That algorithm should parallelize up to maybe
22	10-100k nodes, so it can run arbitrarily fast. The downside to hadoop
23	is the hassle of getting that working, and that you need to move the
24	1+GB git repository over to your cluster (which isn't terribly fast in
25	my case - even with FIOS and EC2). That is a fixed overhead we really
26	can't do much about.
27
28	I'm going to set aside the git part of the problem for now as that is
29	just optimization. Next steps would be to find some easy way to run
30	that in parallel without hadoop. For anybody interested in taking a
31	whack at that, you can chop up the input at line boundaries and run
32	the map steps as parallel as you want to. You then need to cat/sort
33	the output of all the maps. Then the output of that can be split
34	going into the reduce step, but you have to keep lines with the same
35	key together (the first tab-delimited column). I think I could
36	probably do that on the cheap with some bash scripting and gnu
37	parallel - just set up temp dirs for input/output and dump every job
38	into a separate file. I suspect that Hadoop will be hard to beat if
39	you want to cluster it, but for a single node you could make it a lot
40	simpler.
41
42	Next step is to get cvs into a similar format. My initial thoughts:
43
44	1. Just run cvs log in the root, chop it up at file boundaries,
45	base64 encode each blob, and dump that into a text file one file per
46	line.
47	2. Distribute processing of each file, turning it into one line per
48	commit with all the info my git program dumps, save the file hash.
49	3. Distribute reading those lines, checking out that one version of
50	one file, calculating the hash, and outputting the full info.
51
52	Steps 1/2 might be cheaper to just combine since you have to scan the
53	whole thing to chop it up and the parsing can't be THAT expensive.
54
55	If there are libs to make any of this easier I'm all ears, but it
56	seems like there isn't much out there - nothing like pygit2.
57
58	Once I have both I can start working on validation rules and perhaps
59	get feedback to the conversion team. We'll need to work out what does
60	and doesn't count as OK. We're doing transformation of data during
61	migration, so I need to take that into account. Either the logic goes
62	into the compare function, or the logic goes into the dump side so
63	that the compares work out the same. Timestamps might force us to do
64	logic during compare anyway.
65
66	So, that's all for now. If anybody wants my notes for running hadoop
67	let me know - maybe I'll digest those and make it into a blog entry.
68	However, it would be worthwhile to see if we can ditch hadoop. If we
69	are stuck with it then we can probably get the time down to 30min with
70	< 100 nodes.
71
72	Rich

Replies

Subject	Author
Re: [gentoo-scm] Git Validation - Update	Peter Stuge <peter@×××××.se>
Re: [gentoo-scm] Git Validation - Update	Michael Haggerty <mhagger@××××××××.edu>
[gentoo-scm] Re: Git Validation - Update	Rich Freeman <rich0@g.o>

Report Message

Find on MARC Find on Google Groups