MobileRead Forums - View Single Post

kiwidude · 04-25-2011, 02:15 PM

Changes in this release:

New gui allowing any combination of title and author algorithms
Changed similar author algorithm to be more conservative (all initials must match)
Added a fuzzy author algorithm which compares using last name and first initial of first name. Ignores common suffixes like Jr, Sr etc
Added a fuzzy title algorithm which strips any subtitles, and anything after keywords of "and", "or" and "aka" provided they are not the first word in the title. Pretty aggressive but catches a lot of cases mentioned on this thread
Added a soundex algorithm for both titles and authors. This may need some tuning for the length of the soundex but is pretty handy for catching common misspellings in titles
Tweaked the "identical" title/author comparisons to ensure casing differences are treated as a match.
Added option to sort the results by groups with the most candidates first

For the more technically minded (or interested) you can now find all the algorithms and test code/cases for them in "algorithms.py" in the zip file. You can run this yourself with "calibre-debug -e algorithms.py". So you can see the range of permutations I currently test for and those that I still expect not to be caught.

In terms of the examples posted earlier on this thread, I think all of them can now be found by one algorithm or another, with the exception of this one:
Foundation 5 - Foundation and Earth
Foundation and Earth

It will however find this:
Foundation and Earth - Foundation 5
Foundation and Earth

Of course it is pretty easy to do a sanity check on your library using "title:-" or the Quality Check plugin to detect such cases and fix them before you do your duplicate run.

Look forward to hearing what you think. My todo list with this is now done - with the possible exception of some slightly improved tag browser