Gentoo Archives: gentoo-scm

From: Rich Freeman <rich0@g.o>
To: gentoo-scm@l.g.o
Subject: Re: [gentoo-scm] Re: [gentoo-dev] CVS -> git, list of where non-infra folk can contribute
Date: Tue, 02 Oct 2012 23:38:47
Message-Id: CAGfcS_m2126tn9vrTdvMRJhZV0O1npqLWDk4styyzbGCR=8xKA@mail.gmail.com
In Reply to: Re: [gentoo-scm] Re: [gentoo-dev] CVS -> git, list of where non-infra folk can contribute by Peter Stuge
1 On Tue, Oct 2, 2012 at 5:39 PM, Peter Stuge <peter@×××××.se> wrote:
2 > Rich Freeman wrote:
3 >> That means that instead of descending the entire tree for every
4 >> commit you only actually descend the branches that have changes
5 >> on each commit, which in most cases will just be a single branch
6 >> anyway.
7 >
8 > What tree is that? If you're talking about the git DAG that doesn't
9 > make sense to me at all. :\
10
11 See my earlier email in the archives. My proposal for validating the
12 conversion is to go through both repositories and generate a list for
13 all files in the repo (active and deleted) which contains:
14 path/file, hash, date/time, author, commit comment
15
16 Then comparing them should be easy.
17
18 For git the conceptual way to do this is to take each commit and walk
19 its tree and generate a dump of all the files in the tree with the
20 info above. Do this for each commit in one big list, sort the list by
21 file and date/time, and then delete all but the first line in cases
22 where consecutive lines have the same hash.
23
24 However, this involves walking the entire tree for every commit, which
25 is a lot of unnecessary work since in any commit little has actually
26 changed.
27
28 So, instead you pass through the list in stages. The first stage
29 involves walking the linear list of commits (fortunately no branching
30 here), and dumping a list like the one above except instead of
31 path/file you have trees.
32
33 At each subsequent step the algorithm is:
34 1. Read a line.
35 2. If it is a blob, output the same line.
36 3. If it is a tree, read the tree and output one line for each of its
37 contents, with all the info but the path/file and hash remaining the
38 same as what was on the line that was read. Be sure to insert the
39 current directory into the path for each line.
40 4. Goto step 1 until the entire file is read.
41 5. Sort the output by file and then date/time. Delete all but the
42 first line for any cases where consecutive lines have the same hash.
43
44 Each time you do this you descend the entire tree by one step,
45 discarding cases where from commit to commit a section of the tree
46 contained no changes. It is important to be careful with the sorting
47 because if a tree changed from A to B back to A you don't want to lose
48 one of the As, since it would have a unique time/reason/author/etc.
49
50 You then repeat the entire process until you go through the entire
51 file and there are no trees left. Then after the sort/trim step you
52 have a complete history per-file for the entire repository.
53
54 I need to look at the case for cvs next. However, I think cvs
55 basically stores the history per-file anyway, so that is probably just
56 a matter of walking the entire repository once and dumping the data in
57 the same format.
58
59 Where I use the word "line" above that could be xml, a serialized
60 array of some sort, or whatever. It probably would be best to not
61 just do this in a text file to avoid issues with delimiters, unicode,
62 etc. However, if there were a delimiter we could be certain is
63 nowhere in the repository a text file would work fine.
64
65 That algorithm should work fine in parallel - just chop up the list
66 and have each worker do its list. The sort/trim step does have to be
67 universal, but as long as you sort the whole list and chop it up on
68 file boundaries you could actually distribute the trim part. It
69 really does seem like a Map/Reduce from the perspective of the list,
70 but the workers need access to the git repository to "parse" the list
71 and I'm not sure that Hadoop/etc lets you do this easily. The map
72 function is essentially a simple function that happens to have a 2GB
73 "lookup table" inside.
74
75 Rich