Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 08-11-2010, 03:27 AM   #1
spacemonkey
Enthusiast
spacemonkey doesn't litterspacemonkey doesn't litter
 
Posts: 47
Karma: 120
Join Date: Aug 2010
Device: kindle 3 wifi
Metadata download

I recently got started with Calibre, and one area of particular interest to me was using the meta data and covers download function.

I had 2 issues with this:
1) if I use the context menu download metadata and covers I often got a bad selection coming back
2) if I try to download for an ebook with a LOT of possible matches (especially the case with old books) it often hangs and fails to download

I've started hacking around the code with some success and thought I'd post my findings/thoughts to see if anyone else had good suggestions for what else I should do. If I get to a good level of success I plan on giving back my code tweaks.

What I've found/done so far:
1) - The method it uses when you right click seems to be to just select the first matched result from the same results you'd see if you used the interactive download and pick function in the edit individual metadata screen.
This list sorted on the basis of the following in the priority order listed:
  • Title is an exact match (although code comments imply titles starting with a match)
  • books with coverart
  • books with longer descriptions
I've changed this a little to
  • Title is an exact match
  • Title begins with the title
  • Author exists in authors collection
  • books with coverart
  • books with longer descriptions
Which seems to be getting me some success, however I think I might try and work this more towards an overall scoring rather than just a progression down the list.

2) On this item, I did some network tracing and found that it would get a very large list to a maximum of 1000 odd titles (which wasn't too much of a problem) but then it would start gathering cover art and wider details book by book, which if the list was big (an example I was working on was a list of nearly 200) it would time out during the cover art and additional data steps.

Obviously I could just increase the timeout, but that felt painful, so instead I modified the process to do the afore mentioned sorting function (except for evaluating coverart) then trimming the list to 20 items before continuing with the book by book downloads.

Any thoughts suggestions would be appreciated.

SM
spacemonkey is offline   Reply With Quote
Old 08-11-2010, 09:24 AM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by spacemonkey View Post
I've changed this a little to

* Title is an exact match
* Title begins with the title
* Author exists in authors collection
* books with coverart
* books with longer descriptions


Any thoughts suggestions would be appreciated.
Code tweaks are always welcome.

Fuzzy title/author matching is fraught with risks when doing automated downloading of metadata. By default, Calibre overwrites the author/title found during the search, so you have to be certain it's a pretty good match or risk losing the basic info Calibre uses to identify the book.

My personal bugaboo is when the search title has an apostrophe and the online data does not, or vice-a-versa. I'm constantly re-fetching by adding or removing an apostrophe.

FYI, I can think of two other spots in Calibre's code that do fuzzy matching like this. One is the automatic sorting during Add Books with the "If books with similar author...." option. That does a fuzzy match on title if there is an exact match on author. The fuzzy match is not aggressive - it mostly ignores words like "a," "an" and "the" and non-alphanumeric characters. The other is in Chaley's code that identifies books on a device that are also in the Calibre database.
Starson17 is offline   Reply With Quote
Advert
Old 08-11-2010, 11:18 AM   #3
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,775
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
A spamassassin type score based system may be more successful than the current best match finding algorithm, but it would be a lot of testing/tweaking to get the scores right.

Changing the algorithm to only check for covers of a subset of matched books is a good idea.
kovidgoyal is offline   Reply With Quote
Old 08-12-2010, 03:09 AM   #4
spacemonkey
Enthusiast
spacemonkey doesn't litterspacemonkey doesn't litter
 
Posts: 47
Karma: 120
Join Date: Aug 2010
Device: kindle 3 wifi
Quote:
Originally Posted by Starson17 View Post
My personal bugaboo is when the search title has an apostrophe and the online data does not, or vice-a-versa. I'm constantly re-fetching by adding or removing an apostrophe.
Interesting, what I may do then before attempting any scoring system would be to aim for automatically gathering results with and without punctuation into a single list and then sorting on the current basis. So if the title is "Bridget Jones' Diary" get it to isbndn both "Bridget Jones Diary" and "Bridget Jones' Diary" and concatenate the results sets.

I've noticed similar things for initial style authors eg H G Wells vs H. G. Wells vs H. G Wells etc.
spacemonkey is offline   Reply With Quote
Old 08-12-2010, 03:44 PM   #5
Sydney's Mom
Wizard
Sydney's Mom ought to be getting tired of karma fortunes by now.Sydney's Mom ought to be getting tired of karma fortunes by now.Sydney's Mom ought to be getting tired of karma fortunes by now.Sydney's Mom ought to be getting tired of karma fortunes by now.Sydney's Mom ought to be getting tired of karma fortunes by now.Sydney's Mom ought to be getting tired of karma fortunes by now.Sydney's Mom ought to be getting tired of karma fortunes by now.Sydney's Mom ought to be getting tired of karma fortunes by now.Sydney's Mom ought to be getting tired of karma fortunes by now.Sydney's Mom ought to be getting tired of karma fortunes by now.Sydney's Mom ought to be getting tired of karma fortunes by now.
 
Sydney's Mom's Avatar
 
Posts: 2,895
Karma: 6995721
Join Date: Dec 2008
Location: Idaho, on the side of a mountain
Device: Kindle Oasis, Fire 3d Gen and 5th Gen and Samsung Tab S
If I can't find metadata, first I make sure there is no ISBN (always seems to be wrong, even though that is what CPL has). Then I delete everything except author or title. I usually find it.
Sydney's Mom is offline   Reply With Quote
Advert
Old 08-12-2010, 04:05 PM   #6
travger
Evangelist
travger ought to be getting tired of karma fortunes by now.travger ought to be getting tired of karma fortunes by now.travger ought to be getting tired of karma fortunes by now.travger ought to be getting tired of karma fortunes by now.travger ought to be getting tired of karma fortunes by now.travger ought to be getting tired of karma fortunes by now.travger ought to be getting tired of karma fortunes by now.travger ought to be getting tired of karma fortunes by now.travger ought to be getting tired of karma fortunes by now.travger ought to be getting tired of karma fortunes by now.travger ought to be getting tired of karma fortunes by now.
 
travger's Avatar
 
Posts: 480
Karma: 270594
Join Date: Aug 2010
Device: palm tx, Windows7, Galaxy A5
I tried to get metadata for "The Hammer" by S. M. Stirling & David Drake. Nothing. Succeeded only when I changed author field for David Drake & S. M. Stirling.
And of course, those apostrophes...
I'd like to be offered some selections even with just last name of the author, or no author at all.
travger is offline   Reply With Quote
Old 08-12-2010, 04:22 PM   #7
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by travger View Post
I tried to get metadata for "The Hammer" by S. M. Stirling & David Drake. Nothing. Succeeded only when I changed author field for David Drake & S. M. Stirling.
And of course, those apostrophes...
I'd like to be offered some selections even with just last name of the author, or no author at all.
After a bulk fetch, I go back through the remaining books to solve individual problems. Typically, for a problem book, I copy and paste the author and title into the comments section (so I don't lose them), then leave only the author's last name and one unique word from the title. That seems to produce decent results in most cases. For a multi-author book, I try each author. If I get a good result, I replace the full author and title either by allowing overwrite or using cut/paste from the comments filed where I stored them originally.
Starson17 is offline   Reply With Quote
Old 08-13-2010, 01:46 AM   #8
spacemonkey
Enthusiast
spacemonkey doesn't litterspacemonkey doesn't litter
 
Posts: 47
Karma: 120
Join Date: Aug 2010
Device: kindle 3 wifi
A further thing I've found (haven't done any actual work on this because my PC was broke, almost fixed now)

If you use the interactive human based search on www.isbndb.com it works amazing well and gives you a good prioritised list of results.

You stick in "title by author(s)" in the text box and wahay. Pity their API does not implement the richer search. I'm almost tempted to build a screen scraper for the html version of the site.
spacemonkey is offline   Reply With Quote
Old 08-13-2010, 07:56 AM   #9
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by spacemonkey View Post
If you use the interactive human based search on www.isbndb.com it works amazing well and gives you a good prioritised list of results.

You stick in "title by author(s)" in the text box and wahay. Pity their API does not implement the richer search. I'm almost tempted to build a screen scraper for the html version of the site.
Better search results when metadata fetching would be greatly appreciated by many users.
Starson17 is offline   Reply With Quote
Old 08-13-2010, 09:44 AM   #10
kumbaja
Junior Member
kumbaja is on a distinguished road
 
Posts: 8
Karma: 58
Join Date: Aug 2010
Device: iPad
I've taken a look at the code today, too and I can kind of confirm your initial assessment with the exception that I think you are missing one step.
Quote:
* Results from isbndb and google books are merged by isbn. The first database queried takes precedence in case of conflict.
* Title is an exact match (although code comments imply titles starting with a -match)
* books with coverart
* books with longer descriptions
IMO there is a big problem in the current implementation. For one the merge is somewhat problematic, because potentially good data from a second data source gets dropped without any inspection.
Second, the way the title comparison works is that it only considers one title better than another title if and only if it is equal to the search query title (sans some common stop words and case-insensitive). If two results titles are both different than the queried title they are both considered "equally bad" and only the cover art and description lenght is then used for sorting.
In other words if there are no results with an exact title match then the remaining results are ordered by cover art and description regardless of title (and author and publisher btw).
IMO we should consider some type of distance metric for title matching such as Levenshtein distance, jaccard similarity or even TF/IDF (see: http://www.dcs.shef.ac.uk/~sam/stringmetrics.html)
kumbaja is offline   Reply With Quote
Old 08-13-2010, 10:51 AM   #11
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
A quick comment - kumbaja and spacemonkey are both recent users of Calibre, if you go by their MR registration date. It's likely both are engaged in the time-consuming project of adding a pre-existing ebook collection into Calibre. When I was engaged in that project, I wrote a half dozen bits of code to make book entry and metadata fetching easier. Now my motivation is less. Like most other long-term users, all my old collection is entered, and the annoyances for adding the books I buy are far less annoying than when I was trying to add hundreds of books at a time.

Bottom line - it's not surprising that newer users have a bigger incentive to improve this part of the code than long-time users who have all of their library in good shape already.

Go to it boys - the part of the code you're looking at can definitely use some improvement. I look forward to seeing what you can come up with!
Starson17 is offline   Reply With Quote
Old 08-13-2010, 11:15 AM   #12
kumbaja
Junior Member
kumbaja is on a distinguished road
 
Posts: 8
Karma: 58
Join Date: Aug 2010
Device: iPad
@Starson: Exactly Add to that, that I am German and like to read both German and English language books, which makes metadata fetching an even more difficult excercise.

@spacemonkey: Afaik the isbndb API uses the same algorithm and sorting as their website. At least that's what the API documentation says. I think some of less than optimal results in calibre comes from the merging and resorting inside calibre.
kumbaja is offline   Reply With Quote
Old 08-13-2010, 11:26 AM   #13
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,775
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Yeah, I on the other hand, never added large numbers of books into calibre, which is why those bits of code have never received as much love as other parts.

I look forward to patches, this is how calibre improves, when motivated people work on the parts of calibre are that are important to them.
kovidgoyal is offline   Reply With Quote
Old 08-13-2010, 11:51 AM   #14
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kovidgoyal View Post
Yeah, I on the other hand, never added large numbers of books into calibre, which is why those bits of code have never received as much love as other parts. .
The first bit of code I submitted was about one line long. It changed the way the checkbox "Swap author firstname and lastname" worked. IIRC, at the time "Smith, John" was being swapped to "John Smith," (leaving the comma) and "Smith, John Quincy" was being swapped to "Quincy Smith, John." It was pretty obvious that you weren't using that part of the code much and it could use some love (Also, IIRC, you had to fix my one line of code )

I'm sure we'd all prefer you work on the big projects (client-server, pdf conversion, etc.) and let others find a few places in the code that they care about and want to polish. I've worked on both the add books and metadata fetching code, and it really helps to have done a lot of those operations to understand what problems can occur.
Starson17 is offline   Reply With Quote
Reply

Tags
metadata download

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Error on Download Metadata PCreighton Calibre 1 09-03-2010 07:34 PM
Metadata download plugins newbino Plugins 3 07-31-2010 11:25 PM
Failed to download metadata: since 0.7.5 - 0.7.6 chrisix Calibre 3 07-03-2010 05:45 PM
Download Metadata not working in 6.52 Sydney's Mom Calibre 4 05-15-2010 01:34 PM
Metadata - I only want to download Comments guyanonymous Calibre 9 01-23-2010 07:36 AM


All times are GMT -4. The time now is 12:07 PM.


MobileRead.com is a privately owned, operated and funded community.