similarity-utils

Two programs to give a quantitative measure of how similar two files are. similarity_by_diff measures the number of difference lines reported by diff(1), while similarity_by_zlib tries compressing the two files separately and together - if they are similar then there should be a space saving from concatenating before compression.

The output of these programs is a line giving the two filenames followed by a similarity rating from 0 to 1 inclusive. The higher the rating, the more similar - 1 means identical or very nearly.

The tool 'pairs' is just a quick way to run these programs on all pairs taken from a list of files.

Examples

As a way to find duplicate or almost-duplicate files:

pairs similarity_by_diff * | sort -k 3 -rn | head

One application is in keeping track of diverging source code. If you have a modified version 'foo.modified' of a source file, you may wish to diff it to see what changes it introduces. But if 'foo' has changed in the meantime, it might not be obvious what version 'foo.modified' was forked from. If you use CVS you can say something like
for i in `seq 1 10`; do cvs update -p -r 1.$i foo >foo.$i; done
to check out all versions of foo into the current directory, and then
(for i in *; do similarity_by_diff foo.$i foo.modified; done) \
| sort -k 3 -rn | head

to see which version of foo is most similar, and thus the best starting point for making a diff.

If you have an even messier situation with several different versions of the same source file to be merged, you could start by finding the most similar pair using the first example above.

The tools might also work for detecting plagiarism, though it is easy to fool diff(1) into thinking every line is different. Perhaps similarity_by_zlib would do a better job. In general, _by_diff works better with text files and _by_zlib with binary files (that are not themselves compressed).

Download

The latest version is 0.2.2. This fixes similarity_by_diff for filenames containing spaces.

Installation

./install --prefix=/usr/local
or whatever.

Bugs

similarity_by_diff should allow options like --ignore-blank-lines to be passed along to diff.

Because of the way zlib works, similarity_by_zlib isn't too great for files greater than 32Kibyte in length. It will compare the last 32K of one file to the first 32K of the other.

Because similarity_by_diff is written as a shell script, it's likely to get tripped up by filenames containing odd characters.

Author and copying

By Ed Avis <ed@membled.com>. This software is in the public domain, which means there are no copyright restrictions. Giving credit is polite but not required.


Ed Avis
Last modified: Fri Nov 7 15:05:11 GMT 2008