Dedication and Paranoia

One of our main projects has still been sitting in CVS for some time. Everything else under active development either started in, or was moved to SVN for some time now. But, this particular project always has something on the go, usually with several developers, and so just never got moved due to time constraints and hassle. So, we recently put aside the time and migrated it over to SVN (using cvs2svn)

Everything seemed to go smoothly at first. Examining the code and comparing the fresh SVN checkout with my old CVS working copy didn’t show anything amiss. It compiled fine as well. However, it became clear there was a problem when after starting up, nothing would load and there were exceptions in the logs including a lot of “class def not found” exceptions.

A bit of hunting and we eventually determined that one of the included libraries was corrupted (confirmed by the md5sum not matching). We’re not sure how that had happened in the migration (we checked and it was indeed committed as a binary file in CVS), but we replaced it with the pre-migration jar and things seemed to start up fine. We conducted some tests, and couldn’t find any further issues, but my coworker was not feeling entirely comfortable calling it done and successful after our little jar issue.

Checksums of the other jars didn’t show anything wrong, but we unfortunately we couldn’t compare our code itself because most of our files had CVS keyword tags in them, which had expanded differently causing every source file to be different. A little stumped, we decided that since the testing didn’t find anything, and since it was getting late, we’d call it done.

Later in the night, I get this email:

Hey, So I decided to figure out how to compare the files due to being really paranoid about it.

I got it pretty completely compared, and the only difference I can spot is a version number in a property. I got a list of every unique file extension in the tree, then added them all to the find command. I ended up regexing out all of the various $ID, $Author, etc before the compare, and ignoring whitespace in the compare.

After that I got the sha1sum (because being in paranoid mode, I was worried about a possible MD5 collision or something) of every jar in the whole tree, and diffed those, with no differences at all.

So, with all of the text based files compared, all of the JARs, and having identical file counts in both working copies, I’m much much more confident that everything is ok. Tomorrow I’ll do the same thing I did with the jars against all the other types of binary files, and then I’ll be completely satisfied. If you wouldn’t mind then, I’ll also ask you to look over what I did to make sure I’m sane.

Have a good night.

Dedication and paranoia! I’ve got a lot of respect for that. We didn’t find any additional problems, but it was good to check. You don’t want to take risks with your source repository!

Anyhow, if you’re at all interested, or ever in need to do similar sort of thing, here’s the scripting:

# This command will output the list of all unique file extensions under the specified directory
# Files with no extension will have their full path printed instead of nothing / just the extension
find <project dir>. -type f | awk -F'.' '{print $NF}' | sort| uniq

# This command will diff (-w ignore whitespace) the 2 directories in the find commands, against all
# of the listed file extensions, removing lines that include "@author", "@version", "$Header:",
# "$Id:", "$Revision:", "$Date:", "$Source:", "$RCSfile:"
## No output means no differences were found
diff -w <(find <svn project dir> -type f \( -name "*.java" -o -name "*.properties" -o -name
"*.xml" -o -name "*.dtd" -o -name "*.xsl" -o -name "*.xsd" -o -name "*.wsdd" -o -name "*.tld"
-o -name "*.sql" -o -name "*.sh" -o -name "*.prefs" -o -name "*.jsp" -o -name "*.js" -o -name
"*.html" -o -name "*.htm" -o -name "*.groovy" -o -name "*.dml" -o -name "*.cgi" -o -name
"*.conf" -o -name "*.csv" -o -name "*.css" -o -name "*.bat" -o -name "*.aj" \) -exec cat {} \;
| sed '/.*\$Header:/d' | sed '/.*@author/d' | sed '/.*@version/d' | sed '/.*\$Id:/d'| sed
'/.*\$Revision:/d' | sed '/.*\$Date:/d' | sed '/.*\$Source:/d' | sed '/.*\$RCSfile:/d')
<(find <cvs project dir> -type f \( -name "*.java" -o -name "*.properties" -o -name "*.xml"
-o -name "*.dtd" -o -name "*.xsl" -o -name "*.xsd" -o -name "*.wsdd" -o -name "*.tld" -o -name
"*.sql" -o -name "*.sh" -o -name "*.prefs" -o -name "*.jsp" -o -name "*.js" -o -name "*.html"
-o -name "*.htm" -o -name "*.groovy" -o -name "*.dml" -o -name "*.cgi" -o -name "*.conf" -o
-name "*.csv" -o -name "*.css" -o -name "*.bat" -o -name "*.aj" \) -exec cat {} \; | sed
'/.*\$Header:/d'| sed '/.*@author/d' | sed '/.*@version/d' | sed '/.*\$Id:/d' | sed
'/.*\$Revision:/d' | sed '/.*\$Date:/d' | sed '/.*\$Source:/d' | sed '/.*\$RCSfile:/d')

# This command will diff the SHA-1 sum of every binary file in the 2 directories
## No output means no differences were found
diff <(find <svn project dir> -type f \( -name "*.jar" -o -name "*.bmp" -o -name "*.dll" -o
-name "*.doc" -o -name "*.exe" -o -name "*.GIF" -o -name "*.gif" -o -name "*.ico" -o -name
"*.iml" -o -name "*.ipr" -o -name "*.JPG" -o -name "*.jpg" -o -name "*.pdf" -o -name "*.png"
-o -name "*.rpt" -o -name "*.xls" \) -exec sha1sum {} \; | sed 's/<svn project dir>\///')
<(find <cvs project dir> -type f \( -name "*.jar" -o -name "*.bmp" -o -name "*.dll" -o -name
"*.doc" -o -name "*.exe" -o -name "*.GIF" -o -name "*.gif" -o -name "*.ico" -o -name "*.iml"
-o -name "*.ipr" -o -name "*.JPG" -o -name "*.jpg" -o -name "*.pdf" -o -name "*.png" -o -name
"*.rpt" -o -name "*.xls" \) -exec sha1sum {} \; | sed 's/<cvs project dir>\///')

# The next 2 commands will list the number of files in the 2 directories
## No output means no differences were found
diff <(find <svn project dir> -type f | wc -l) <(find <cvs project dir> -type f | wc -l)