find_identical_books in database2.py is used in the autosort/automerge code to find books that have a identical author(s) and nearly identical titles (see fuzzy_title). If the autosort/automerge option is on, incoming books are compared to existing book records with find_identical_books, and the incoming format is added to the existing record.
I considered doing fuzzy matching on authors, and more aggressive fuzzy matching on titles, but for automatic merging there were too many errors. If you're going to write a duplicate finder, you can be aggressive, provided you only display the results and merging is done manually.
I don't know if you want to only compare titles, or if you also want to consider authors. I was thinking about multiple types of dupe finding, selectable during the search: 1) Match on title (fuzzy) only, 2) match on title (fuzzy) and exact author (this is what find_identical_books produces), and 3) match on title (fuzzy - but aggressive, ignoring plurals) and author (fuzzy - ignore initials, Jr., and second authors, etc)
You will find a couple of threads on duplicate finding, if you search here, that provide SQL searches to be run with the calibre-debug tool or with an SQL database browser like SQLiteSpy. I've found that most of my "duplicates" on title only are not really duplicates, just similar or identical titles.
|