Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 12-09-2010, 12:16 PM   #1
silentguy
Connoisseur
silentguy doesn't littersilentguy doesn't littersilentguy doesn't litter
 
Posts: 88
Karma: 200
Join Date: Nov 2010
Location: Dortmund, Germany
Device: Kindle Paperwhite (10. Generation)
Find duplicate books...

Hi!
I got the feeling that the duplicate check that calibre offers by default upon adding is not good enough, so I thought I'd try to understand how the search source code works and make a plugin that makes some searches to find duplicates and display them in the books list.
Sadly, I haven't neither been able to figure out how to properly search from a plugin not how to tell it to display only the books my search returned. Could someone point me in the right direction?
Essentially I just want it to display all the books with the following ids "select group_concat(id) from books group by UPPER(title) having count(*) > 1;"
Well, that query could probably be tweaked, but I thought before I thought about it more, I'd figure out the basics :P
silentguy is offline   Reply With Quote
Old 12-09-2010, 01:22 PM   #2
silentguy
Connoisseur
silentguy doesn't littersilentguy doesn't littersilentguy doesn't litter
 
Posts: 88
Karma: 200
Join Date: Nov 2010
Location: Dortmund, Germany
Device: Kindle Paperwhite (10. Generation)
Okay, combining the hello world gui plugin with
db = self.gui.library_view.model().db
dupes = db.conn.get('select group_concat(id) from books group by UPPER(title) having count(*) > 1;')
allowed me to build a message box displaying the id's of duplicate books. next step, messing with the few and trying to figure out if my direct sql access is kinda bad :P
silentguy is offline   Reply With Quote
Advert
Old 12-09-2010, 01:28 PM   #3
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,871
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
direct sql access is fine, if it is readonly. If you want to make changes you hsould use the api methods in LibraryDatabase2 as they do various things to maintain consistency.
kovidgoyal is online now   Reply With Quote
Old 12-09-2010, 06:47 PM   #4
silentguy
Connoisseur
silentguy doesn't littersilentguy doesn't littersilentguy doesn't litter
 
Posts: 88
Karma: 200
Join Date: Nov 2010
Location: Dortmund, Germany
Device: Kindle Paperwhite (10. Generation)
Yay, my first version of the plugin is working. It searches for equal titles (with and without equal case) and then displays the found books.
I started working on something like "similar" title, but that was hard to put into a nice query...

http://bugs.calibre-ebook.com/ticket/4571
silentguy is offline   Reply With Quote
Old 12-09-2010, 09:06 PM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,871
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You'll get a lot more fexibility if you use the python cache rather than SQL. Look at find_identical_books in database2.py
kovidgoyal is online now   Reply With Quote
Advert
Old 12-10-2010, 05:00 AM   #6
aceflor
Wizard
aceflor ought to be getting tired of karma fortunes by now.aceflor ought to be getting tired of karma fortunes by now.aceflor ought to be getting tired of karma fortunes by now.aceflor ought to be getting tired of karma fortunes by now.aceflor ought to be getting tired of karma fortunes by now.aceflor ought to be getting tired of karma fortunes by now.aceflor ought to be getting tired of karma fortunes by now.aceflor ought to be getting tired of karma fortunes by now.aceflor ought to be getting tired of karma fortunes by now.aceflor ought to be getting tired of karma fortunes by now.aceflor ought to be getting tired of karma fortunes by now.
 
aceflor's Avatar
 
Posts: 3,470
Karma: 48036360
Join Date: Aug 2009
Location: where the sun lives, or so they say
Device: Pocketbook Era, Pocketbook Inkpad 4, Kobo Libra 2, Kindle Scribe
wrong thread, sorry

Last edited by aceflor; 12-10-2010 at 05:15 AM. Reason: wrong thread, sorry
aceflor is offline   Reply With Quote
Old 12-10-2010, 09:42 AM   #7
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
find_identical_books in database2.py is used in the autosort/automerge code to find books that have a identical author(s) and nearly identical titles (see fuzzy_title). If the autosort/automerge option is on, incoming books are compared to existing book records with find_identical_books, and the incoming format is added to the existing record.

I considered doing fuzzy matching on authors, and more aggressive fuzzy matching on titles, but for automatic merging there were too many errors. If you're going to write a duplicate finder, you can be aggressive, provided you only display the results and merging is done manually.

I don't know if you want to only compare titles, or if you also want to consider authors. I was thinking about multiple types of dupe finding, selectable during the search: 1) Match on title (fuzzy) only, 2) match on title (fuzzy) and exact author (this is what find_identical_books produces), and 3) match on title (fuzzy - but aggressive, ignoring plurals) and author (fuzzy - ignore initials, Jr., and second authors, etc)

You will find a couple of threads on duplicate finding, if you search here, that provide SQL searches to be run with the calibre-debug tool or with an SQL database browser like SQLiteSpy. I've found that most of my "duplicates" on title only are not really duplicates, just similar or identical titles.
Starson17 is offline   Reply With Quote
Old 12-10-2010, 10:28 AM   #8
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
I ended up writing my own external tool to query the database and find duplicates based on quite a range of criteria. It is fairly "fuzzy" in that it strips off stuff like lead "A " and "The ", rips out characters like colons, apostrophes etc and pumps out various sets of results. It also does "starts with" type checks, as there could be the same book but a longer version of the title as many books often have. Similar starts with type checks done on Authors (taking into account the first initial). And as all my Authors are supposed to be stored LN, FN I also look for names stored as FN LN (no comma) or authors where the names were imported the wrong way around so stored as FN, LN.

It "does a job" and helps me eliminate many duplicates I would otherwise have. However one of the issues as Starson17 says is that for certain types of checks there are genuine exceptions to a rule, and you can get a problem with wasting your time re-verifying that exception every time you run the duplicate check, particularly when you have lots of books. The "fuzzier" the search, the better chance of finding duplicates but the more false positives you have to keep looking through.

My current solution to this is to:
- run the duplicate check and process the results until happy with them.
- run the check again. The output this time should just be the stuff I am happy to be treated as exceptions. I save that output as a text file.
- Then the next time I run the check, I open up the previous output in Notepad++, paste the new output into another tab then use it's built-in diff functionality to just highlight the "new" duplicates it has detected.
- once that is done, go back to the second step, thereby overwriting my new baseline.

All that baseline/persistence/identify only new stuff could be built into a tool but the above was just a quick and dirty "get it done" approach I use. Will be interested to see what evolves from other ideas people have. Just food for thought.
kiwidude is offline   Reply With Quote
Old 12-10-2010, 10:45 AM   #9
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kiwidude View Post
I ended up writing my own external tool to query the database and find duplicates based on quite a range of criteria. It is fairly "fuzzy" in that it strips off stuff like lead "A " and "The ", rips out characters like colons, apostrophes etc and pumps out various sets of results.
This is built into find_identical_books

Quote:
It also does "starts with" type checks, as there could be the same book but a longer version of the title as many books often have.
This is a good point. I've seen a few of this type of dupe that weren't caught with my tools.

Quote:
you can get a problem with wasting your time re-verifying that exception every time you run the duplicate check, particularly when you have lots of books.
Like you, I built my own dupe checker, and like you, I found myself rechecking the same exceptions a lot. One of the reasons I posted was to highlight the same issue you have highlighted - what you want or need the dupe checker to do seems to change as you use it. I found myself changing the search a lot to look for dupes in different ways and spending too much time looking at the exceptions. For a while I had a custom boolean column that meant "If all dupes found for this title have this column checked, we are not dupes of each other"
Starson17 is offline   Reply With Quote
Old 12-10-2010, 10:54 AM   #10
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Yes you are right, it does evolve a bit, and probably like me you tune it to the way you store books. For instance I do stuff like strip off "(Omnibus)" since I know that is how I store those types of titles amongst various other things.

One point I did not mention which you brought up again with the find_identical_books comment. One of the common things I find is that the built-in logic that Calibre has at time of import does you no good if the filename was not "close enough" at the time of import. For instance a common thing I miss is to not spot a missing space between the series hyphen and the title. So my title gets imported with a name like "Series X-Title name" which the Calibre logic cannot pick up. Now that is easy to spot in Calibre when you review your newly added books, you fix the title/series up correctly and think job done. However of course that can now result in a duplicate.

My point being that regardless of how much cleverness goes into the "merge" logic, there will always be situations where as the result of an edit you now have a duplicate that only some sort of post-check can pickup, replicating similar and other more extensive checks.
kiwidude is offline   Reply With Quote
Old 12-10-2010, 11:03 AM   #11
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kiwidude View Post
One point I did not mention which you brought up again with the find_identical_books comment. One of the common things I find is that the built-in logic that Calibre has at time of import does you no good if the filename was not "close enough" at the time of import. For instance a common thing I miss is to not spot a missing space between the series hyphen and the title. So my title gets imported with a name like "Series X-Title name" which the Calibre logic cannot pick up. Now that is easy to spot in Calibre when you review your newly added books, you fix the title/series up correctly and think job done. However of course that can now result in a duplicate.
Yes. This specific issue is an interaction between the regex used to identify the title and series, and the autosort/automerge code that compares the title passed to it by the regex with the title of existing book records. The missing space caused the regex to think the title was "Series X-Title name" and that didn't match the book title of "Title name."

Quote:
My point being that regardless of how much cleverness goes into the "merge" logic, there will always be situations where as the result of an edit you now have a duplicate that only some sort of post-check can pickup, replicating similar and other more extensive checks.
Agreed. I've started to put together a dupe checker a few times, but my motivation is low now that my library is in good shape.
Starson17 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
How can you get rid of duplicate books? pmatch1104 Calibre 4 12-02-2010 11:08 PM
where do l find books.... caddie Deals and Resources (No Self-Promotion or Affiliate Links) 11 03-13-2010 08:29 AM
PRS-600 Duplicate books radcliffe287 Sony Reader 4 12-18-2009 06:54 AM
Duplicate books on reader bassett520 Calibre 2 11-29-2009 08:51 PM
Duplicate books - multiple formats mranlett Calibre 5 09-26-2009 07:02 AM


All times are GMT -4. The time now is 02:26 AM.


MobileRead.com is a privately owned, operated and funded community.