10-18-2010, 03:11 PM | #1 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Calibre Duplicates
(This is a long rambling thinking aloud post .... briefly - it's about cleaning up an existing Calibre library by finding "duplicate" records and displaying them so the user can merge them together.)
I chose a thread title that's identical to an existing thread title because it's perfect for my question. The previous thread asked about duplicates and got several answers about duplicates on devices, but not much about duplicates in the main library. I've been asked several times about how to find "duplicate" book record in the main library and there's a thread where I suggested using an SQL browser (I use SQLiteSpy) with a suggested SQL query. I also gave a Calibre-only SQL command line version: Code:
calibre-debug -c "from calibre.library.database2 import LibraryDatabase2; db = LibraryDatabase2('/path/to/library/folder');dupes = db.conn.get('select title from books group by title having count(*) > 1;');print dupes;">dupes.txt All of these options produce a list of "duplicates" outside of Calibre. That's inconvenient, as usually you want to merge or delete them. A partial solution is the "calibre-debug -e find_duplicates.py >output.txt" approach to echo output into a text file, and have the output create a search command that can be cut/pasted into Calibre's searchbar. Still, it's not perfect. I've thought a bit about how to do this inside Calibre. I was recently motivated by a bug/enhancement request one user submitted who wanted the merge command to provide a list of "duplicates" based on duplicate isbn numbers. That raises the question of - What is a "duplicate"? (You will note that I've been putting that word into quotes.) Calibre already has multiple definitions of a duplicate: The first definition is the one used by the Kovid's original Add Book command. It considers a duplicate to be any book that has the same title. Then there's the "duplicate" of my autosort/automerge code. It considers a duplicate to be any book that has the same author, and a title that's very similar (fuzzy matched with punctuation, capitalization, leading indefinite articles, etc. ignored). Finally, there's Charle's code for duplicates on devices. I won't try to summarize it beyond saying it considers the UUID, the Calibre book ID as well as author/title matches. One could come up with multiple other definitions of a "duplicate" including the isbn duplicate of the enhancement request. Each of the above is useful in different ways. Sometimes one wants to find identical titles (or fuzzy-matched titles) because the authors are not correct. Maybe only one of multiple authors is on each duplicate record. Maybe there's a typo, etc. I've played with duplicate detection for a while. Even after you've decided what kind of "duplicates" you want to search for, you have a problem with displaying the results. If you use the searchbar, there's no good way to separate out the duplicate groups you've found. If you exact match title or author, you can sort on that, but fuzzy matched titles/authors don't always sort together and isbn numbers don't display at all. I've occasionally ended up with search results showing duplicate books where I'm not sure why the search thought they were duplicates. The first book might be a duplicate of the second and third, while the fourth was found because the search considered it to be a duplicate of the fifth, while the sixth through ninth were all duplicates of themselves, etc. The breaks are often hard to find. IOW: Do you have any need for a duplicates search? How do we define duplicates? Build your own search? List of predefined searches? How does one display the "duplicates" (even if one finds them) in a way that makes it clear what's a duplicate of what? Where would one put this search - in the Similar Books menu option? Under Advanced Search options? @Charles - do you have any thoughts on whether an SQL search could be integrated into the search query system? (I'm not even sure if SQL will always be used within the library.) Feel free to ramble back at me. I'm not convinced I'll write anything, but I'd be interested in the comments of others. Last edited by Starson17; 10-18-2010 at 03:13 PM. |
10-18-2010, 03:58 PM | #2 | |
Grand Sorcerer
Posts: 11,741
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
First, your "I'm not even sure" is valid. However, it would be close to impossible to implement current calibre functionality without a usable query language, so the question really becomes 'Do you have any thoughts on providing support for semi-arbitrary queries?' The easy areas is selection: field relop value etc. That is so close to what we do today that I am not worried. More interesting is the implicit use of functions. Your 'count' example is a good example. And it immediately leads us to the notion of functions (count, average, exists, etc). Any reasonable query system must support these kinds of questions, but perhaps with different syntax. The third level would be the unraveling of joins, permitting selection by individual multiple items. For example, inside calibre, there are selections that look for a book referencing a particular author, where the query checks against the n-ary connection tables. The same query can be expressed against the book view where the authors are joined together, but there is the possibility of both substring errors and performance problems. Final level: testing for null using left or outer joins. These are hard. A good example would be a query that finds books with no formats (left join with NULL as the result). Bottom line: if an arbitrary query language can be defined in terms of a set of well-understood functions, fields, and relops, then I can see building it. This is especially true if the current search capabilities are expressible in the 'language'. This is even more true if the calibre functions like 'set_author' are expressible, but this may be silly. Going beyond that to, for example, raw SQL will require Kovid to make a commitment that I suspect he is unwilling to make. |
|
Advert | |
|
10-18-2010, 04:00 PM | #3 |
creator of calibre
Posts: 43,856
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Raw SQL is not going to happen. However, one of my side projects is providing a python console in calibre that should allow you to execute SQL queries and manipulate the results very easily.
|
10-19-2010, 04:40 PM | #4 |
Member
Posts: 13
Karma: 10
Join Date: Oct 2010
Device: iPad
|
Hi
I've put the request based upon on the ISBN since i the scenario where is have some books in PDF format and others in LIT format. I have found that i could have differences in the titles and author field but same ISBN most probably from add process in the first place. Since i have a large library looking for duplicates and merging or deleting is very time consuming. Would be good to have a way of doing a search where the system identifies the same ISBN's and would allow in bulk to merge the items or just remove the duplicates. This happens when we load different formats at different time. I have had the case of when i load the documents i have lots of books (PDF Articles) on the case of Unknown Author and common title merge by title and author would merge different books. The ideal solution would a window with a table where the user could see the list or duplicated items grouped and would be possible to select bulk option of removing or merging the items or just select case by case. If we could choose the fields that would be used create the rule for the query, like search by ISBN or by combination Author/Title/Series would be even better I know that this may be quite a lot of work and will only be of great use for people that have large libraries and have different document formats. Thank you for all your effort, Calibre is a great tool. |
10-19-2010, 09:53 PM | #5 |
Wizard
Posts: 4,812
Karma: 26912940
Join Date: Apr 2010
Device: sony PRS-T1 and T3, Kobo Mini and Aura HD, Tablet
|
For me a duplicate would be same title+same author but different folder number, ignoring file extensions and The and A at the beginning of the title.
I have actually seen an ISBN with two different books. The case (s) were when adding a book with cover and ISBN but no tags. The edit metadata ISBN returned a totally different book but the cover of the book I was adding had ISBN number printed on it. Any format for the result would be good. CSV copiable to clipboard seems kind of familiar. An indicator such as an * at the end of each book to determine whether it is an exact match or fuzzy might be of some help as well. I have been meaning to do something like this in excel with a catalogue but haven't got around to it yet. |
Advert | |
|
10-19-2010, 10:53 PM | #6 | |
Well trained by Cats
Posts: 29,800
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
The first digit of an ISBN is the Language- then Publisher- then book_number- the a Mod11 check digit The total length is always 10 digits (the check digit can be "X") the language is 1 digit The check digit is always 1 char the publisher can be from 1 to 7 digits, the book number is whats left over (a 7 digit pub, hasa sigle digit for books="really small press ) So, was you "duplicate ISBN from the same publisher? (it is not uncommon to see EAN numbers that have been "made up" rather than pay the fee for a manufacturers number group ) |
|
10-20-2010, 12:58 AM | #7 | |
Wizard
Posts: 4,812
Karma: 26912940
Join Date: Apr 2010
Device: sony PRS-T1 and T3, Kobo Mini and Aura HD, Tablet
|
Quote:
Helen |
|
10-20-2010, 09:10 AM | #8 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
When I get a chance, I'll post some simple duplicate_finder.py python code files here that can be run with "calibre-debug -e duplicate_finder.py" |
|
10-20-2010, 10:07 AM | #9 | |
Well trained by Cats
Posts: 29,800
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
There does not seem to be any "Normalization" of data. Author names appear with spaces between initials, and sometimes not. |
|
10-20-2010, 10:54 AM | #10 | |
Grand Sorcerer
Posts: 11,741
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
However, spelling variations are not automatically corrected. For example, outside of merge processing, calibre does not consider any of "Lawrence, D H", "Lawrence, DH", or "Lawrence, D.H.", "D H Lawrence", or "Lawrence, D" to be the same author. The merge code may detect some of these because it strips punctuation before doing the compare. |
|
10-20-2010, 12:06 PM | #11 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
With respect to Calibre, the autosort/automerge code used when Adding Books does strip punctuation from titles before doing compares to find identical books already in the library, but it doesn't use that "normalized" title for anything other than the compare to existing book titles, and it doesn't strip punctuation from authors, only titles. The Merge code that merges existing records (as compared to autosort/automerge) doesn't do anything to the author or title. The author/title of the first selected book is always kept and the others dropped, except when they're "Unknown," in which case the first not-Unknown author or title encountered in the merge selection list is used, if there are any. BTW, the reason I think of the automerge option as also being "autosort" is that when a large block of files is added, the code separates the incoming files into three distinct groups - files that are merged into an existing book record (that didn't have a matching format), files that are added as a new book record (no match on author/title) and files that are not added (author/title/format all matched so the file could not be merged to the matching book record). |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Calibre Duplicates | mitch13 | Calibre | 5 | 11-13-2010 06:42 AM |
Possible Bug on Duplicates | Giuseppe Chillem | Calibre | 3 | 05-06-2010 07:26 PM |
Duplicates | pauldadams | Calibre | 17 | 05-04-2010 11:57 PM |
Duplicates... | jaxx6166 | Sony Reader | 5 | 07-09-2009 09:13 PM |
duplicates in database | RJA | Calibre | 3 | 06-22-2009 09:06 AM |