Gentoo Archives: gentoo-scm

From: Rich Freeman <rich0@g.o>
To: gentoo-scm@l.g.o
Subject: Re: [gentoo-scm] Bizarre Python Issue with Validation
Date: Mon, 15 Oct 2012 11:14:00
Message-Id: CAGfcS_=ZnpDXV8VXEMxkFXLkR13tNmGOiAQxdYNsbFqTA-J9sA@mail.gmail.com
In Reply to: Re: [gentoo-scm] Bizarre Python Issue with Validation by Brian Harring
1 On Sun, Oct 14, 2012 at 8:08 PM, Brian Harring <ferringb@×××××.com> wrote:
2 > Probably not the answer you want... but I suggest you valgrind it. If
3 > the behaviour is differing for Repository()[some-hash], that makes me
4 > think there is a ref counting bug afoot.
5
6 Ugh.
7
8 >
9 > Also, wtf is dumbo?
10
11 It is a simplified framework for doing parallel jobs that can run on
12 hadoop. That said, after finding some documentation on just running
13 python on hadoop directly I'm considering just going that route
14 (should be easy to change).
15
16 > And can you clarify how this is doing it's
17 > validation, rough runtime, etc?
18
19 See my earlier posts on this list. Basically the goal is to break
20 both git/cvs into sorted lists of files and their individual histories
21 and then compare them. They would be one commit per line. Due to how
22 Hadoop works I'm just using csv with a few fields in base64.
23
24 If you're looking at the python file I linked to, keep in mind that
25 what it does is takes in a list of trees/blobs in such a format,
26 decends one level on each, removes duplicates, and outputs a new list
27 of trees/blobs. That would be itereated. Since each commit typically
28 only changes a few files at the bottom of the tree it the list would
29 not be expected to grow much on each iteration (all but one entry at
30 the top level of the tree would be a duplicate from one commit to the
31 next). However, if several things did change they'd end up in the
32 list.
33
34 This treatment allows each record to be treated entirely in parallel.
35 You start out with 1.2M of them which is how many commits there are in
36 the git repository (at least the one I'm testing on), grows 5-40x
37 after the map step, and shrinks to just over 1x on the reduce step.
38 When you run a step and either end up with all blobs or unchanged
39 output you know you're done (not sure what the fastest way to detect
40 that condition is - the list isn't THAT big so grep is probably good
41 enough on the interim files).
42
43 > Minimally, it's going to need to go
44 > parallel- I rebuilt the conversion bits (parallelized along category
45 > lines, then reintegrating after the fact), and it's around 50m for
46 > run- validation being equivalent would definitely be preferable.
47
48 That was my concern as well. The solution is parallel and I'm looking
49 to run it on hadoop. In theory the cvs side should be easier to do -
50 I'm running into more issues with the tools than the software. It
51 works just fine on a few thousand commits with my last design.
52 Actually, once I ditch the dumbo framework I might see how it does
53 without a cluster - I was running into RAM issues before and the new
54 design would operate as a pipe. As long as you chop up the input file
55 at line boundaries and feed it to the map step and concatenate/sort
56 the output of the map step to feed it to reduce anything that runs the
57 script in parallel would work.
58
59 In fact, one of my concerns is just the time it takes to transfer that
60 2GB git repository over to a cluster and get everything running -
61 especially if you've got the conversion time down to 50m (I heard we
62 were at 8h previously). I was hoping to have the testing done in
63 under an hour.
64
65 Shouldn't take more than a few days for me to get around to modifying
66 my script to not require dumbo and just testing it out on datasets in
67 serial. If I can get the speed up we might just be able to run it on
68 whatever host you're using for the conversion and that would eliminate
69 a lot of hassle.
70
71 >
72 > @mhagger; details of that I'll share in a bit- roughly, it's
73 > exploiting the fact gentoo-x86 /never/ has cross category commits (no
74 > one does repo wide commits, although detection/merging of that I'm
75 > checking for).
76
77 Well, my validation only checks that the per-file history is correct,
78 which I think is all cvs really lets you do anyway. So, it won't
79 distinguish between a->a' then b->b' vs (a,b)->(a',b'). However, if
80 your code is somehow losing changes it will certainly find that.
81 Also, if the timestamps on two commits are the same it will not
82 distinguish between the order of those two commits, but I don't think
83 that info is really in cvs either. I haven't gotten far enough to
84 discover if there are any issues with time resolution/conversion/etc.
85
86 I'm certainly interested in seeing what you came up with. I take it
87 that you're writing something from scratch here?
88
89 Rich