Gentoo Archives: gentoo-scm

From: Rich Freeman <rich0@g.o>
To: gentoo-scm@l.g.o
Subject: Re: [gentoo-scm] cvs/irker/git thread from -dev
Date: Tue, 02 Oct 2012 02:09:44
Message-Id: CAGfcS_mtnu+Mb9-CxUq=sr7p9p=-3GqCkzzZLOmOThPyckEo_w@mail.gmail.com
In Reply to: Re: [gentoo-scm] cvs/irker/git thread from -dev by Michael Mol
1 On Mon, Oct 1, 2012 at 9:46 PM, Michael Mol <mikemol@×××××.com> wrote:
2 > This was on -dev a while ago. Was there nothing additional?
3
4 Not yet - there was a bunch of good conversation on irc over the last
5 few hours.
6
7 Note that this list is archived on gmane, and it is probably worth
8 reviewing for past discussions.
9
10 >
11 > Diego had a fair point about having to checkout a crapton of git
12 > history. I think that could be significantly improved with a seed
13 > tarball on a torrent, particularly if the seed tarball is updated
14 > annually (or biannually).
15
16 Initial distribution will be via a tarball of some kind (not sure if
17 it is just a tarball of the repository, or in an importable git
18 format) to save on load. It would be available with or without
19 history. I'm not actually certain that you can't commit without the
20 full history. I would think that as long as git could fast-forward
21 from whatever gentoo considers the head to whatever you consider the
22 head you'd be fine.
23
24 From IRC the big issues at the moment are that we know the current
25 migration drops some old content from the history. It is also hard to
26 validate the content.
27
28 Validation is tricky because CVS tends to store per-file history, and
29 git tends to store a history of the head which incorporates all the
30 files within. So, traversing each in a logical order makes it really
31 hard to compare them. However, this really seems like a good problem
32 for something like MapReduce.
33
34 If you traverse git for each commit you should be able to output:
35 path/file hash date/time committer message
36
37 Then you'd sort that by the first three fields, and drop all lines
38 where the first two fields don't change. That then gives you a sorted
39 commit history per file. Then you could traverse cvs and get the same
40 (the output of cvs log per-file should do that). Then you compare the
41 two lists.
42
43 You could do the dumps for all the git commits in parallel, and
44 sorting is parallel as well (basically automatic with mapreduce I
45 think). For CVS you should be able to do each file in parallel, and
46 again the sorting is parallel.
47
48 Note I have zero experience with MapReduce. However, it seems like a
49 really fun project. I need to learn more about the "logical" model
50 behind CVS. I understand it fairly well for git and that is a thing
51 of beauty. You don't even need to calculate the hashes for git since
52 they're already stored in the tree.
53
54 Rich