(This is a long rambling thinking aloud post .... briefly - it's about cleaning up an existing Calibre library by finding "duplicate" records and displaying them so the user can merge them together.)
I chose a thread title that's identical to an existing thread title because it's perfect for my question. The previous thread asked about duplicates and got several answers about duplicates on devices, but not much about duplicates in the main library.
I've been asked several times about how to find "duplicate" book record in the main library and there's a thread where I suggested using an SQL browser (I use
SQLiteSpy) with a suggested SQL query. I also gave a Calibre-only SQL command line version:
Code:
calibre-debug -c "from calibre.library.database2 import LibraryDatabase2; db = LibraryDatabase2('/path/to/library/folder');dupes = db.conn.get('select title from books group by title having count(*) > 1;');print dupes;">dupes.txt
I also put together a find_duplicates.py file that has SQL code that's executed with "calibre-debug -e find_duplicates.py."
All of these options produce a list of "duplicates" outside of Calibre. That's inconvenient, as usually you want to merge or delete them. A partial solution is the "calibre-debug -e find_duplicates.py >output.txt" approach to echo output into a text file, and have the output create a search command that can be cut/pasted into Calibre's searchbar. Still, it's not perfect.
I've thought a bit about how to do this inside Calibre. I was recently motivated by a bug/enhancement request one user submitted who wanted the merge command to provide a list of "duplicates" based on duplicate isbn numbers. That raises the question of - What is a "duplicate"? (You will note that I've been putting that word into quotes.)
Calibre already has multiple definitions of a duplicate:
The first definition is the one used by the Kovid's original Add Book command. It considers a duplicate to be any book that has the same title.
Then there's the "duplicate" of my autosort/automerge code. It considers a duplicate to be any book that has the same author, and a title that's very similar (fuzzy matched with punctuation, capitalization, leading indefinite articles, etc. ignored).
Finally, there's Charle's code for duplicates on devices. I won't try to summarize it beyond saying it considers the UUID, the Calibre book ID as well as author/title matches.
One could come up with multiple other definitions of a "duplicate" including the isbn duplicate of the enhancement request.
Each of the above is useful in different ways. Sometimes one wants to find identical titles (or fuzzy-matched titles) because the authors are not correct. Maybe only one of multiple authors is on each duplicate record. Maybe there's a typo, etc.
I've played with duplicate detection for a while. Even after you've decided what kind of "duplicates" you want to search for, you have a problem with displaying the results. If you use the searchbar, there's no good way to separate out the duplicate groups you've found. If you exact match title or author, you can sort on that, but fuzzy matched titles/authors don't always sort together and isbn numbers don't display at all.
I've occasionally ended up with search results showing duplicate books where I'm not sure why the search thought they were duplicates. The first book might be a duplicate of the second and third, while the fourth was found because the search considered it to be a duplicate of the fifth, while the sixth through ninth were all duplicates of themselves, etc. The breaks are often hard to find.
IOW:
Do you have any need for a duplicates search?
How do we define duplicates? Build your own search? List of predefined searches?
How does one display the "duplicates" (even if one finds them) in a way that makes it clear what's a duplicate of what?
Where would one put this search - in the Similar Books menu option? Under Advanced Search options?
@Charles - do you have any thoughts on whether an SQL search could be integrated into the search query system? (I'm not even sure if SQL will always be used within the library.)
Feel free to ramble back at me. I'm not convinced I'll write anything, but I'd be interested in the comments of others.