1 |
On Mon, Oct 1, 2012 at 9:46 PM, Michael Mol <mikemol@×××××.com> wrote: |
2 |
> This was on -dev a while ago. Was there nothing additional? |
3 |
|
4 |
Not yet - there was a bunch of good conversation on irc over the last |
5 |
few hours. |
6 |
|
7 |
Note that this list is archived on gmane, and it is probably worth |
8 |
reviewing for past discussions. |
9 |
|
10 |
> |
11 |
> Diego had a fair point about having to checkout a crapton of git |
12 |
> history. I think that could be significantly improved with a seed |
13 |
> tarball on a torrent, particularly if the seed tarball is updated |
14 |
> annually (or biannually). |
15 |
|
16 |
Initial distribution will be via a tarball of some kind (not sure if |
17 |
it is just a tarball of the repository, or in an importable git |
18 |
format) to save on load. It would be available with or without |
19 |
history. I'm not actually certain that you can't commit without the |
20 |
full history. I would think that as long as git could fast-forward |
21 |
from whatever gentoo considers the head to whatever you consider the |
22 |
head you'd be fine. |
23 |
|
24 |
From IRC the big issues at the moment are that we know the current |
25 |
migration drops some old content from the history. It is also hard to |
26 |
validate the content. |
27 |
|
28 |
Validation is tricky because CVS tends to store per-file history, and |
29 |
git tends to store a history of the head which incorporates all the |
30 |
files within. So, traversing each in a logical order makes it really |
31 |
hard to compare them. However, this really seems like a good problem |
32 |
for something like MapReduce. |
33 |
|
34 |
If you traverse git for each commit you should be able to output: |
35 |
path/file hash date/time committer message |
36 |
|
37 |
Then you'd sort that by the first three fields, and drop all lines |
38 |
where the first two fields don't change. That then gives you a sorted |
39 |
commit history per file. Then you could traverse cvs and get the same |
40 |
(the output of cvs log per-file should do that). Then you compare the |
41 |
two lists. |
42 |
|
43 |
You could do the dumps for all the git commits in parallel, and |
44 |
sorting is parallel as well (basically automatic with mapreduce I |
45 |
think). For CVS you should be able to do each file in parallel, and |
46 |
again the sorting is parallel. |
47 |
|
48 |
Note I have zero experience with MapReduce. However, it seems like a |
49 |
really fun project. I need to learn more about the "logical" model |
50 |
behind CVS. I understand it fairly well for git and that is a thing |
51 |
of beauty. You don't even need to calculate the hashes for git since |
52 |
they're already stored in the tree. |
53 |
|
54 |
Rich |