Two programs to give a quantitative measure of how similar two files are. similarity_by_diff measures the number of difference lines reported by diff(1), while similarity_by_zlib tries compressing the two files separately and together - if they are similar then there should be a space saving from concatenating before compression.
The output of these programs is a line giving the two filenames followed by a similarity rating from 0 to 1 inclusive. The higher the rating, the more similar - 1 means identical or very nearly.
The tool 'pairs' is just a quick way to run these programs on all pairs taken from a list of files.
As a way to find duplicate or almost-duplicate files:
pairs similarity_by_diff * | sort -k 3 -rn | head
One application is in keeping track of diverging source code. If you
have a modified version 'foo.modified' of a source file, you may wish
to diff it to see what changes it introduces. But if 'foo' has
changed in the meantime, it might not be obvious what version
'foo.modified' was forked from. If you use CVS you can say something
like
for i in `seq 1 10`; do cvs update -p -r 1.$i foo >foo.$i; done
to check out all versions of foo into the current directory, and then
(for i in *; do similarity_by_diff foo.$i foo.modified; done) \
| sort -k 3 -rn | head
to see which version of foo is most similar, and thus the best
starting point for making a diff.
If you have an even messier situation with several different versions of the same source file to be merged, you could start by finding the most similar pair using the first example above.
The tools might also work for detecting plagiarism, though it is easy to fool diff(1) into thinking every line is different. Perhaps similarity_by_zlib would do a better job. In general, _by_diff works better with text files and _by_zlib with binary files (that are not themselves compressed).
similarity_by_diff should allow options like --ignore-blank-lines to be passed along to diff.
Because of the way zlib works, similarity_by_zlib isn't too great for files greater than 32Kibyte in length. It will compare the last 32K of one file to the first 32K of the other.
Because similarity_by_diff is written as a shell script, it's likely to get tripped up by filenames containing odd characters.
By Ed Avis <ed@membled.com>. This software is in the public domain, which means there are no copyright restrictions. Giving credit is polite but not required.