MobileRead Forums - View Single Post

Starson17 · 12-10-2010, 10:42 AM

find_identical_books in database2.py is used in the autosort/automerge code to find books that have a identical author(s) and nearly identical titles (see fuzzy_title). If the autosort/automerge option is on, incoming books are compared to existing book records with find_identical_books, and the incoming format is added to the existing record.

I considered doing fuzzy matching on authors, and more aggressive fuzzy matching on titles, but for automatic merging there were too many errors. If you're going to write a duplicate finder, you can be aggressive, provided you only display the results and merging is done manually.

I don't know if you want to only compare titles, or if you also want to consider authors. I was thinking about multiple types of dupe finding, selectable during the search: 1) Match on title (fuzzy) only, 2) match on title (fuzzy) and exact author (this is what find_identical_books produces), and 3) match on title (fuzzy - but aggressive, ignoring plurals) and author (fuzzy - ignore initials, Jr., and second authors, etc)

You will find a couple of threads on duplicate finding, if you search here, that provide SQL searches to be run with the calibre-debug tool or with an SQL database browser like SQLiteSpy. I've found that most of my "duplicates" on title only are not really duplicates, just similar or identical titles.

12-10-2010, 10:42 AM	#7
Starson17 Wizard Posts: 4,004 Karma: 177841 Join Date: Dec 2009 Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T	find_identical_books in database2.py is used in the autosort/automerge code to find books that have a identical author(s) and nearly identical titles (see fuzzy_title). If the autosort/automerge option is on, incoming books are compared to existing book records with find_identical_books, and the incoming format is added to the existing record. I considered doing fuzzy matching on authors, and more aggressive fuzzy matching on titles, but for automatic merging there were too many errors. If you're going to write a duplicate finder, you can be aggressive, provided you only display the results and merging is done manually. I don't know if you want to only compare titles, or if you also want to consider authors. I was thinking about multiple types of dupe finding, selectable during the search: 1) Match on title (fuzzy) only, 2) match on title (fuzzy) and exact author (this is what find_identical_books produces), and 3) match on title (fuzzy - but aggressive, ignoring plurals) and author (fuzzy - ignore initials, Jr., and second authors, etc) You will find a couple of threads on duplicate finding, if you search here, that provide SQL searches to be run with the calibre-debug tool or with an SQL database browser like SQLiteSpy. I've found that most of my "duplicates" on title only are not really duplicates, just similar or identical titles.