Ok, today's pop quiz question - who can offer me an efficient file comparison algorithm?
I've tried a first pass of finding books with the same size, and then a second pass using the sha256 hash. However this has two problems - (a) it is still pretty darn slow for large libraries (around 4.5 minutes to scan a 40,000 book library with a fair few formats), and (b) after all that it still isn't "accurate" enough, returning a bunch of duplicates which really aren't, they just "hash" together.
Suggestions on a postcard please