1 |
Two items in this email - one is a quick status update, but I think |
2 |
the more important issue is stepping back to figure out why we're even |
3 |
doing this... |
4 |
|
5 |
Quick status update - I can parse git repos and cvs repos and come up |
6 |
with csv files that describe each fairly well. |
7 |
Issues remaining: |
8 |
1. subdirs deleted in cvs don't show up in cvs log, and therefore not |
9 |
in my list. That needs a move from cvs to rlog to fix. (small effort) |
10 |
2. Files deleted in git don't show up. That needs a move from looking |
11 |
at commits in isolation to doing pairwise comparisons. (significant |
12 |
effort) |
13 |
3. File hashes don't match, because the migration changes the headers. |
14 |
(medium effort) |
15 |
4. Authors don't match, because these are also transformed. (medium effort) |
16 |
5. Timestamps need a fuzz factor due to Manifest commit squashing. |
17 |
(small effort on comparision side) |
18 |
|
19 |
Overall the general sense I'm getting is that the migration is working |
20 |
fine. Identifying subtle issues will require addressing most of the |
21 |
items above - otherwise making sense of all the differences is |
22 |
difficult without manual inspection. What I have manually inspected |
23 |
has turned out fine. What I can glean from overall results also looks |
24 |
good (number of files per commit, etc). |
25 |
|
26 |
This leads me to my question regarding approach. Just what is the |
27 |
goal of validation, and why are we doing it? |
28 |
|
29 |
With the number of transformations involved in the git migration it is |
30 |
becoming apparent that the only way to really check it is essentially |
31 |
to implement it twice independently and confirm they lead to the same |
32 |
output. I can cut corners like just applying a fuzz factor to the |
33 |
timestamps, but really this is turning into implementing the migration |
34 |
twice. |
35 |
|
36 |
As far as speed goes - all of this is coded in python so it isn't |
37 |
optimal. Just about everything I'm doing can be run in parallel |
38 |
(especially after switching to rlog), but it is going to consume an |
39 |
hour or two most likely. For general testing of the migration process |
40 |
I think that is adequate, but as a post-migration step it will |
41 |
probably take longer than the migration itself (the cvs side can run |
42 |
in parallel with migration at least). |
43 |
|
44 |
I'm open to suggestions, but rather than fully re-inventing the wheel |
45 |
I'm thinking that fixing issues #1 and #5 above might be as far as I |
46 |
go with this. They're easy to fix, and #1 is resulting in huge gaps. |
47 |
What that will tell us is that nothing is getting missed in the |
48 |
migration. |
49 |
|
50 |
Others are of course welcome to pitch in as well, but I still think |
51 |
we're re-inventing the wheel. I'm trying to focus my efforts on doing |
52 |
analysis that is likely to spot actual problems, and not just |
53 |
re-running the same functions on the same data to get the same answer. |
54 |
|
55 |
Code review of ferringb's work might be more productive in terms of |
56 |
spotting problems. So might be publishing his bundles and letting |
57 |
people spot check their favorite packages. |
58 |
|
59 |
If we were doing this at work we'd probably spot check data with |
60 |
formal comparison scripts (involving human comparison), and then |
61 |
preserve a copy of the cvs repo for its retention period just in case. |
62 |
|
63 |
What are the general thoughts here? I don't want to hold up moving |
64 |
forward with the migration to continuously refine a second |
65 |
implementation of something that is already implemented. |
66 |
|
67 |
Rich |