Calibre Duplicates

Starson17 · 10-18-2010, 03:11 PM

(This is a long rambling thinking aloud post .... briefly - it's about cleaning up an existing Calibre library by finding "duplicate" records and displaying them so the user can merge them together.)

I chose a thread title that's identical to an existing thread title because it's perfect for my question. The previous thread asked about duplicates and got several answers about duplicates on devices, but not much about duplicates in the main library.

I've been asked several times about how to find "duplicate" book record in the main library and there's a thread where I suggested using an SQL browser (I use SQLiteSpy) with a suggested SQL query. I also gave a Calibre-only SQL command line version:

Code:

calibre-debug -c "from calibre.library.database2 import LibraryDatabase2; db = LibraryDatabase2('/path/to/library/folder');dupes = db.conn.get('select title from books group by title having count(*) > 1;');print dupes;">dupes.txt

I also put together a find_duplicates.py file that has SQL code that's executed with "calibre-debug -e find_duplicates.py."

All of these options produce a list of "duplicates" outside of Calibre. That's inconvenient, as usually you want to merge or delete them. A partial solution is the "calibre-debug -e find_duplicates.py >output.txt" approach to echo output into a text file, and have the output create a search command that can be cut/pasted into Calibre's searchbar. Still, it's not perfect.

I've thought a bit about how to do this inside Calibre. I was recently motivated by a bug/enhancement request one user submitted who wanted the merge command to provide a list of "duplicates" based on duplicate isbn numbers. That raises the question of - What is a "duplicate"? (You will note that I've been putting that word into quotes.)

Calibre already has multiple definitions of a duplicate:

The first definition is the one used by the Kovid's original Add Book command. It considers a duplicate to be any book that has the same title.

Then there's the "duplicate" of my autosort/automerge code. It considers a duplicate to be any book that has the same author, and a title that's very similar (fuzzy matched with punctuation, capitalization, leading indefinite articles, etc. ignored).

Finally, there's Charle's code for duplicates on devices. I won't try to summarize it beyond saying it considers the UUID, the Calibre book ID as well as author/title matches.

One could come up with multiple other definitions of a "duplicate" including the isbn duplicate of the enhancement request.

Each of the above is useful in different ways. Sometimes one wants to find identical titles (or fuzzy-matched titles) because the authors are not correct. Maybe only one of multiple authors is on each duplicate record. Maybe there's a typo, etc.

I've played with duplicate detection for a while. Even after you've decided what kind of "duplicates" you want to search for, you have a problem with displaying the results. If you use the searchbar, there's no good way to separate out the duplicate groups you've found. If you exact match title or author, you can sort on that, but fuzzy matched titles/authors don't always sort together and isbn numbers don't display at all.

I've occasionally ended up with search results showing duplicate books where I'm not sure why the search thought they were duplicates. The first book might be a duplicate of the second and third, while the fourth was found because the search considered it to be a duplicate of the fifth, while the sixth through ninth were all duplicates of themselves, etc. The breaks are often hard to find.

IOW:

Do you have any need for a duplicates search?

How do we define duplicates? Build your own search? List of predefined searches?

How does one display the "duplicates" (even if one finds them) in a way that makes it clear what's a duplicate of what?

Where would one put this search - in the Similar Books menu option? Under Advanced Search options?

@Charles - do you have any thoughts on whether an SQL search could be integrated into the search query system? (I'm not even sure if SQL will always be used within the library.)

Feel free to ramble back at me. I'm not convinced I'll write anything, but I'd be interested in the comments of others.

chaley · 10-18-2010, 03:58 PM

Quote:

Originally Posted by Starson17

@Charles - do you have any thoughts on whether an SQL search could be integrated into the search query system? (I'm not even sure if SQL will always be used within the library.)

This is an interesting question.

First, your "I'm not even sure" is valid. However, it would be close to impossible to implement current calibre functionality without a usable query language, so the question really becomes 'Do you have any thoughts on providing support for semi-arbitrary queries?'

The easy areas is selection: field relop value etc. That is so close to what we do today that I am not worried.

More interesting is the implicit use of functions. Your 'count' example is a good example. And it immediately leads us to the notion of functions (count, average, exists, etc). Any reasonable query system must support these kinds of questions, but perhaps with different syntax.

The third level would be the unraveling of joins, permitting selection by individual multiple items. For example, inside calibre, there are selections that look for a book referencing a particular author, where the query checks against the n-ary connection tables. The same query can be expressed against the book view where the authors are joined together, but there is the possibility of both substring errors and performance problems.

Final level: testing for null using left or outer joins. These are hard. A good example would be a query that finds books with no formats (left join with NULL as the result).

Bottom line: if an arbitrary query language can be defined in terms of a set of well-understood functions, fields, and relops, then I can see building it. This is especially true if the current search capabilities are expressible in the 'language'. This is even more true if the calibre functions like 'set_author' are expressible, but this may be silly. Going beyond that to, for example, raw SQL will require Kovid to make a commitment that I suspect he is unwilling to make.

kovidgoyal · 10-18-2010, 04:00 PM

Raw SQL is not going to happen. However, one of my side projects is providing a python console in calibre that should allow you to execute SQL queries and manipulate the results very easily.

Evilwarning · 10-19-2010, 04:40 PM

Hi

I've put the request based upon on the ISBN since i the scenario where is have some books in PDF format and others in LIT format. I have found that i could have differences in the titles and author field but same ISBN most probably from add process in the first place. Since i have a large library looking for duplicates and merging or deleting is very time consuming. Would be good to have a way of doing a search where the system identifies the same ISBN's and would allow in bulk to merge the items or just remove the duplicates. This happens when we load different formats at different time.

I have had the case of when i load the documents i have lots of books (PDF Articles) on the case of Unknown Author and common title merge by title and author would merge different books.

The ideal solution would a window with a table where the user could see the list or duplicated items grouped and would be possible to select bulk option of removing or merging the items or just select case by case. If we could choose the fields that would be used create the rule for the query, like search by ISBN or by combination Author/Title/Series would be even better

I know that this may be quite a lot of work and will only be of great use for people that have large libraries and have different document formats.

Thank you for all your effort, Calibre is a great tool.

speakingtohe · 10-19-2010, 09:53 PM

For me a duplicate would be same title+same author but different folder number, ignoring file extensions and The and A at the beginning of the title.

I have actually seen an ISBN with two different books.
The case (s) were when adding a book with cover and ISBN but no tags.
The edit metadata ISBN returned a totally different book but the cover of the book I was adding had ISBN number printed on it.

Any format for the result would be good. CSV copiable to clipboard seems kind of familiar.
An indicator such as an * at the end of each book to determine whether it is an exact match or fuzzy might be of some help as well.

I have been meaning to do something like this in excel with a catalogue but haven't got around to it yet.

theducks · 10-19-2010, 10:53 PM

Quote:

Originally Posted by speakingtohe

For me a duplicate would be same title+same author but different folder number, ignoring file extensions and The and A at the beginning of the title.

I have actually seen an ISBN with two different books.
The case (s) were when adding a book with cover and ISBN but no tags.
The edit metadata ISBN returned a totally different book but the cover of the book I was adding had ISBN number printed on it.

Any format for the result would be good. CSV copiable to clipboard seems kind of familiar.
An indicator such as an * at the end of each book to determine whether it is an exact match or fuzzy might be of some help as well.

I have been meaning to do something like this in excel with a catalogue but haven't got around to it yet.

That is not supposed to happen (but can. Printing error.).
The first digit of an ISBN is the Language- then Publisher- then book_number- the a Mod11 check digit
The total length is always 10 digits (the check digit can be "X")
the language is 1 digit
The check digit is always 1 char
the publisher can be from 1 to 7 digits,
the book number is whats left over (a 7 digit pub, hasa sigle digit for books="really small press

)

So, was you "duplicate ISBN from the same publisher?
(it is not uncommon to see EAN numbers that have been "made up" rather than pay the fee for a manufacturers number group

)

speakingtohe · 10-20-2010, 12:58 AM

Quote:

So, was you "duplicate ISBN from the same publisher?
(it is not uncommon to see EAN numbers that have been "made up" rather than pay the fee for a manufacturers number group )

No idea really, but I suspect not because the books did not have any similiarities that I could see. I remember in one cas that the original was a fiction book and the result returned in one instance was self-help because I tried it again several times. Seems that I even removed the ISBN and got the same result so may be some databse cross linking somewhere. Pretty damn sure that I looked up the offending title on Fantasticfiction, clicked on Amazon, plugged in Amazon number and got the same bizarre result. Probably a typo on someone's part somewhere.
Helen

Starson17 · 10-20-2010, 09:10 AM

Quote:

Originally Posted by speakingtohe

For me a duplicate would be same title+same author but different folder number, ignoring file extensions and The and A at the beginning of the title.

That's what I consider to be a duplicate in the autosort/automerge code. (Titles also ignore any capitalization and punctuation). The Copy to Library code was modified to use the same logic, so it's possible to copy your entire library to a new library and get "duplicates" (according to that definition) automatically merged during that copy operation by turning on autosort/automerge in Add Books.

When I get a chance, I'll post some simple duplicate_finder.py python code files here that can be run with "calibre-debug -e duplicate_finder.py"

theducks · 10-20-2010, 10:07 AM

Quote:

Originally Posted by speakingtohe

No idea really, but I suspect not because the books did not have any similiarities that I could see. I remember in one cas that the original was a fiction book and the result returned in one instance was self-help because I tried it again several times. Seems that I even removed the ISBN and got the same result so may be some databse cross linking somewhere. Pretty damn sure that I looked up the offending title on Fantasticfiction, clicked on Amazon, plugged in Amazon number and got the same bizarre result. Probably a typo on someone's part somewhere.
Helen

Probably a DB error

There does not seem to be any "Normalization" of data. Author names appear with spaces between initials, and sometimes not.

chaley · 10-20-2010, 10:54 AM

Quote:

Originally Posted by theducks

There does not seem to be any "Normalization" of data. Author names appear with spaces between initials, and sometimes not.

Authors, and indeed most items, are normalized in the DB sense. Each exists once in some table.

However, spelling variations are not automatically corrected. For example, outside of merge processing, calibre does not consider any of "Lawrence, D H", "Lawrence, DH", or "Lawrence, D.H.", "D H Lawrence", or "Lawrence, D" to be the same author. The merge code may detect some of these because it strips punctuation before doing the compare.

Starson17 · 10-20-2010, 12:06 PM

Quote:

Originally Posted by chaley

Authors, and indeed most items, are normalized in the DB sense. Each exists once in some table.

However, spelling variations are not automatically corrected. For example, outside of merge processing, calibre does not consider any of "Lawrence, D H", "Lawrence, DH", or "Lawrence, D.H.", "D H Lawrence", or "Lawrence, D" to be the same author. The merge code may detect some of these because it strips punctuation before doing the compare.

I'm not sure if the "DB" in question is the Calibre DB or one of the online metadata fetching source DBs (Last time I tried to count them there were 6 of them).

With respect to Calibre, the autosort/automerge code used when Adding Books does strip punctuation from titles before doing compares to find identical books already in the library, but it doesn't use that "normalized" title for anything other than the compare to existing book titles, and it doesn't strip punctuation from authors, only titles.

The Merge code that merges existing records (as compared to autosort/automerge) doesn't do anything to the author or title. The author/title of the first selected book is always kept and the others dropped, except when they're "Unknown," in which case the first not-Unknown author or title encountered in the merge selection list is used, if there are any.

BTW, the reason I think of the automerge option as also being "autosort" is that when a large block of files is added, the code separates the incoming files into three distinct groups - files that are merged into an existing book record (that didn't have a matching format), files that are added as a new book record (no match on author/title) and files that are not added (author/title/format all matched so the file could not be merged to the matching book record).

10-18-2010, 03:11 PM	#1
Starson17 Wizard Posts: 4,004 Karma: 177841 Join Date: Dec 2009 Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T	Calibre Duplicates (This is a long rambling thinking aloud post .... briefly - it's about cleaning up an existing Calibre library by finding "duplicate" records and displaying them so the user can merge them together.) I chose a thread title that's identical to an existing thread title because it's perfect for my question. The previous thread asked about duplicates and got several answers about duplicates on devices, but not much about duplicates in the main library. I've been asked several times about how to find "duplicate" book record in the main library and there's a thread where I suggested using an SQL browser (I use SQLiteSpy) with a suggested SQL query. I also gave a Calibre-only SQL command line version: Code: calibre-debug -c "from calibre.library.database2 import LibraryDatabase2; db = LibraryDatabase2('/path/to/library/folder');dupes = db.conn.get('select title from books group by title having count() > 1;');print dupes;">dupes.txt I also put together a find_duplicates.py file that has SQL code that's executed with "calibre-debug -e find_duplicates.py." All of these options produce a list of "duplicates" outside of Calibre. That's inconvenient, as usually you want to merge or delete them. A partial solution is the "calibre-debug -e find_duplicates.py >output.txt" approach to echo output into a text file, and have the output create a search command that can be cut/pasted into Calibre's searchbar. Still, it's not perfect. I've thought a bit about how to do this inside Calibre. I was recently motivated by a bug/enhancement request one user submitted who wanted the merge command to provide a list of "duplicates" based on duplicate isbn numbers. That raises the question of - What is a "duplicate"? (You will note that I've been putting that word into quotes.) Calibre already has multiple definitions of a duplicate: The first definition is the one used by the Kovid's original Add Book command. It considers a duplicate to be any book that has the same title. Then there's the "duplicate" of my autosort/automerge code. It considers a duplicate to be any book that has the same author, and a title that's very similar (fuzzy matched with punctuation, capitalization, leading indefinite articles, etc. ignored). Finally, there's Charle's code for duplicates on devices. I won't try to summarize it beyond saying it considers the UUID, the Calibre book ID as well as author/title matches. One could come up with multiple other definitions of a "duplicate" including the isbn duplicate of the enhancement request. Each of the above is useful in different ways. Sometimes one wants to find identical titles (or fuzzy-matched titles) because the authors are not correct. Maybe only one of multiple authors is on each duplicate record. Maybe there's a typo, etc. I've played with duplicate detection for a while. Even after you've decided what kind of "duplicates" you want to search for, you have a problem with displaying the results. If you use the searchbar, there's no good way to separate out the duplicate groups you've found. If you exact match title or author, you can sort on that, but fuzzy matched titles/authors don't always sort together and isbn numbers don't display at all. I've occasionally ended up with search results showing duplicate books where I'm not sure why the search thought they were duplicates. The first book might be a duplicate of the second and third, while the fourth was found because the search considered it to be a duplicate of the fifth, while the sixth through ninth were all duplicates of themselves, etc. The breaks are often hard to find. IOW: Do you have any need for a duplicates search? How do we define duplicates? Build your own search? List of predefined searches? How does one display the "duplicates" (even if one finds them) in a way that makes it clear what's a duplicate of what? Where would one put this search - in the Similar Books menu option? Under Advanced Search options? @Charles - do you have any thoughts on whether an SQL search could be integrated into the search query system? (I'm not even sure if SQL will always be used within the library.) Feel free to ramble back at me. I'm not convinced I'll write anything, but I'd be interested in the comments of others. Last edited by Starson17; 10-18-2010 at 03:13 PM.*

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Calibre Duplicates	mitch13	Calibre	5	11-13-2010 06:42 AM
Possible Bug on Duplicates	Giuseppe Chillem	Calibre	3	05-06-2010 07:26 PM
Duplicates	pauldadams	Calibre	17	05-04-2010 11:57 PM
Duplicates...	jaxx6166	Sony Reader	5	07-09-2009 09:13 PM
duplicates in database	RJA	Calibre	3	06-22-2009 09:06 AM

10-18-2010, 04:00 PM	#3
kovidgoyal creator of calibre Posts: 43,856 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Raw SQL is not going to happen. However, one of my side projects is providing a python console in calibre that should allow you to execute SQL queries and manipulate the results very easily.

10-19-2010, 04:40 PM	#4
Evilwarning Member Posts: 13 Karma: 10 Join Date: Oct 2010 Device: iPad	Hi I've put the request based upon on the ISBN since i the scenario where is have some books in PDF format and others in LIT format. I have found that i could have differences in the titles and author field but same ISBN most probably from add process in the first place. Since i have a large library looking for duplicates and merging or deleting is very time consuming. Would be good to have a way of doing a search where the system identifies the same ISBN's and would allow in bulk to merge the items or just remove the duplicates. This happens when we load different formats at different time. I have had the case of when i load the documents i have lots of books (PDF Articles) on the case of Unknown Author and common title merge by title and author would merge different books. The ideal solution would a window with a table where the user could see the list or duplicated items grouped and would be possible to select bulk option of removing or merging the items or just select case by case. If we could choose the fields that would be used create the rule for the query, like search by ISBN or by combination Author/Title/Series would be even better I know that this may be quite a lot of work and will only be of great use for people that have large libraries and have different document formats. Thank you for all your effort, Calibre is a great tool.

10-19-2010, 09:53 PM	#5
speakingtohe Wizard Posts: 4,812 Karma: 26912940 Join Date: Apr 2010 Device: sony PRS-T1 and T3, Kobo Mini and Aura HD, Tablet	For me a duplicate would be same title+same author but different folder number, ignoring file extensions and The and A at the beginning of the title. I have actually seen an ISBN with two different books. The case (s) were when adding a book with cover and ISBN but no tags. The edit metadata ISBN returned a totally different book but the cover of the book I was adding had ISBN number printed on it. Any format for the result would be good. CSV copiable to clipboard seems kind of familiar. An indicator such as an * at the end of each book to determine whether it is an exact match or fuzzy might be of some help as well. I have been meaning to do something like this in excel with a catalogue but haven't got around to it yet.

Advert

Advert