Gentoo Archives: gentoo-scm

From:	Rich Freeman <rich0@g.o>
To:	gentoo-scm@l.g.o
Subject:	Re: [gentoo-scm] Re: [gentoo-dev] CVS -> git, list of where non-infra folk can contribute
Date:	Tue, 02 Oct 2012 23:38:47
Message-Id:	`CAGfcS_m2126tn9vrTdvMRJhZV0O1npqLWDk4styyzbGCR=8xKA@mail.gmail.com`
In Reply to:	Re: [gentoo-scm] Re: [gentoo-dev] CVS -> git, list of where non-infra folk can contribute by Peter Stuge

1	On Tue, Oct 2, 2012 at 5:39 PM, Peter Stuge <peter@×××××.se> wrote:
2	> Rich Freeman wrote:
3	>> That means that instead of descending the entire tree for every
4	>> commit you only actually descend the branches that have changes
5	>> on each commit, which in most cases will just be a single branch
6	>> anyway.
7	>
8	> What tree is that? If you're talking about the git DAG that doesn't
9	> make sense to me at all. :\
10
11	See my earlier email in the archives. My proposal for validating the
12	conversion is to go through both repositories and generate a list for
13	all files in the repo (active and deleted) which contains:
14	path/file, hash, date/time, author, commit comment
15
16	Then comparing them should be easy.
17
18	For git the conceptual way to do this is to take each commit and walk
19	its tree and generate a dump of all the files in the tree with the
20	info above. Do this for each commit in one big list, sort the list by
21	file and date/time, and then delete all but the first line in cases
22	where consecutive lines have the same hash.
23
24	However, this involves walking the entire tree for every commit, which
25	is a lot of unnecessary work since in any commit little has actually
26	changed.
27
28	So, instead you pass through the list in stages. The first stage
29	involves walking the linear list of commits (fortunately no branching
30	here), and dumping a list like the one above except instead of
31	path/file you have trees.
32
33	At each subsequent step the algorithm is:
34	1. Read a line.
35	2. If it is a blob, output the same line.
36	3. If it is a tree, read the tree and output one line for each of its
37	contents, with all the info but the path/file and hash remaining the
38	same as what was on the line that was read. Be sure to insert the
39	current directory into the path for each line.
40	4. Goto step 1 until the entire file is read.
41	5. Sort the output by file and then date/time. Delete all but the
42	first line for any cases where consecutive lines have the same hash.
43
44	Each time you do this you descend the entire tree by one step,
45	discarding cases where from commit to commit a section of the tree
46	contained no changes. It is important to be careful with the sorting
47	because if a tree changed from A to B back to A you don't want to lose
48	one of the As, since it would have a unique time/reason/author/etc.
49
50	You then repeat the entire process until you go through the entire
51	file and there are no trees left. Then after the sort/trim step you
52	have a complete history per-file for the entire repository.
53
54	I need to look at the case for cvs next. However, I think cvs
55	basically stores the history per-file anyway, so that is probably just
56	a matter of walking the entire repository once and dumping the data in
57	the same format.
58
59	Where I use the word "line" above that could be xml, a serialized
60	array of some sort, or whatever. It probably would be best to not
61	just do this in a text file to avoid issues with delimiters, unicode,
62	etc. However, if there were a delimiter we could be certain is
63	nowhere in the repository a text file would work fine.
64
65	That algorithm should work fine in parallel - just chop up the list
66	and have each worker do its list. The sort/trim step does have to be
67	universal, but as long as you sort the whole list and chop it up on
68	file boundaries you could actually distribute the trim part. It
69	really does seem like a Map/Reduce from the perspective of the list,
70	but the workers need access to the git repository to "parse" the list
71	and I'm not sure that Hadoop/etc lets you do this easily. The map
72	function is essentially a simple function that happens to have a 2GB
73	"lookup table" inside.
74
75	Rich

Report Message

Find on MARC Find on Google Groups