1 |
On Tue, Oct 2, 2012 at 5:39 PM, Peter Stuge <peter@×××××.se> wrote: |
2 |
> Rich Freeman wrote: |
3 |
>> That means that instead of descending the entire tree for every |
4 |
>> commit you only actually descend the branches that have changes |
5 |
>> on each commit, which in most cases will just be a single branch |
6 |
>> anyway. |
7 |
> |
8 |
> What tree is that? If you're talking about the git DAG that doesn't |
9 |
> make sense to me at all. :\ |
10 |
|
11 |
See my earlier email in the archives. My proposal for validating the |
12 |
conversion is to go through both repositories and generate a list for |
13 |
all files in the repo (active and deleted) which contains: |
14 |
path/file, hash, date/time, author, commit comment |
15 |
|
16 |
Then comparing them should be easy. |
17 |
|
18 |
For git the conceptual way to do this is to take each commit and walk |
19 |
its tree and generate a dump of all the files in the tree with the |
20 |
info above. Do this for each commit in one big list, sort the list by |
21 |
file and date/time, and then delete all but the first line in cases |
22 |
where consecutive lines have the same hash. |
23 |
|
24 |
However, this involves walking the entire tree for every commit, which |
25 |
is a lot of unnecessary work since in any commit little has actually |
26 |
changed. |
27 |
|
28 |
So, instead you pass through the list in stages. The first stage |
29 |
involves walking the linear list of commits (fortunately no branching |
30 |
here), and dumping a list like the one above except instead of |
31 |
path/file you have trees. |
32 |
|
33 |
At each subsequent step the algorithm is: |
34 |
1. Read a line. |
35 |
2. If it is a blob, output the same line. |
36 |
3. If it is a tree, read the tree and output one line for each of its |
37 |
contents, with all the info but the path/file and hash remaining the |
38 |
same as what was on the line that was read. Be sure to insert the |
39 |
current directory into the path for each line. |
40 |
4. Goto step 1 until the entire file is read. |
41 |
5. Sort the output by file and then date/time. Delete all but the |
42 |
first line for any cases where consecutive lines have the same hash. |
43 |
|
44 |
Each time you do this you descend the entire tree by one step, |
45 |
discarding cases where from commit to commit a section of the tree |
46 |
contained no changes. It is important to be careful with the sorting |
47 |
because if a tree changed from A to B back to A you don't want to lose |
48 |
one of the As, since it would have a unique time/reason/author/etc. |
49 |
|
50 |
You then repeat the entire process until you go through the entire |
51 |
file and there are no trees left. Then after the sort/trim step you |
52 |
have a complete history per-file for the entire repository. |
53 |
|
54 |
I need to look at the case for cvs next. However, I think cvs |
55 |
basically stores the history per-file anyway, so that is probably just |
56 |
a matter of walking the entire repository once and dumping the data in |
57 |
the same format. |
58 |
|
59 |
Where I use the word "line" above that could be xml, a serialized |
60 |
array of some sort, or whatever. It probably would be best to not |
61 |
just do this in a text file to avoid issues with delimiters, unicode, |
62 |
etc. However, if there were a delimiter we could be certain is |
63 |
nowhere in the repository a text file would work fine. |
64 |
|
65 |
That algorithm should work fine in parallel - just chop up the list |
66 |
and have each worker do its list. The sort/trim step does have to be |
67 |
universal, but as long as you sort the whole list and chop it up on |
68 |
file boundaries you could actually distribute the trim part. It |
69 |
really does seem like a Map/Reduce from the perspective of the list, |
70 |
but the workers need access to the git repository to "parse" the list |
71 |
and I'm not sure that Hadoop/etc lets you do this easily. The map |
72 |
function is essentially a simple function that happens to have a 2GB |
73 |
"lookup table" inside. |
74 |
|
75 |
Rich |