Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 04-16-2011, 11:10 PM   #1
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,258
Karma: 1579358
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite, iPad Pro
New metadata API in 0.8 questions

Rather than hijack Idolse's thread I'll ask some questions here and it might make it easier for others to find too in future.

(1) Is there a way in the identify() method of your Source derived plugin of knowing that you have been called interactively from the fetch metadata button on the edit metadata dialog versus non-interactively being called in a batch?

I ask because with Goodreads the search results they will present for a title/author match are a single rolled up record which is from their own "best match" of their metadata sources, but they also have a link to all the individual editions. Sometimes their "best match" isn't the most useful edition (e.g. it is for an audio book, or has no cover etc). So I was thinking it might be nice to actually take the extra query hop in the case of fetching metadata interactively to grab the top few editions (they are sorted) to allow the user to make a choice.

However if the plugin is in the mode of being called as part of a background job, there is no point in slowing it down further by doing the extra query. If I could detect where I was being called from I could optimise for this.

(2) Is it permissible to return multiple covers in the result queue from download_cover()? I ask because it seems the plugins in Calibre are coded to return only a single cover?

Again I'm thinking that "if" I did the extra hop, I could cache the image urls of the top x editions and return different covers for them if available.

If the user has a specific ISBN or Goodreads Id already at the time they do the search, none of this extra edition/covers stuff applies, it is only for title/author searches.
kiwidude is offline   Reply With Quote
Old 04-16-2011, 11:42 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 39,997
Karma: 17764952
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
1) No there's no way to know how identify has been called

2) One cover per plugin is the rule.

I can change 1) quite easily, but (2) is rather more work than I'm willing to put in at this point. Still it should be a fairly easy patch if you are motivated enough to do it.

Though I think that even in the case of bulk metadata downloads you wouldn't want to return a match with no cover/poor metadata, so it is worth fetching all the editions and sorting the list. Run each individual fetch in a thread, so then the hit should only be a few seconds.

This is for example what I do with Amazon, even though their search engine returns fairly good relevance rankings.
kovidgoyal is offline   Reply With Quote
Advert
Old 04-17-2011, 06:27 AM   #3
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,258
Karma: 1579358
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite, iPad Pro
Thanks Kovid. I will try doing the extra hop thing by default and see what happens.

Re the extra covers. I was thinking I would have to have my own mapping cache to store a list of urls per id in the plugin. Is the issue in calibre related to that, or is it further downstream like the cover results are keyed by plugin name or something? I'm not desperate to have this feature as can always add it later, but now I'm curious as to where the limitation lies.

I don't know how google and amazon return results but I seem to have a more complex set of permutations to handle retrieval in identify() than the other plugins. I'm using amazon as the base, so a Worker thread class is similarly responsible for parsing the final detail page. However identify() can be called with a variety of different things and the Goodreads site responds differently to each.

- If I have a Goodreads I'd, then I can immediately construct a URL to give directly to worker
- If I have an isbn, then if Goodreads has a match for it then it will return the details page for it as a response, rather than the search results page.
- If I have an isbn that there is no match for, then the search results page comes back with a no results message.
- If I do a title/author search it will always be the the search results page. However sometimes it seems Goodreads would rather give you search results for a "similar" book than say there were no matches.
- If I have search results, as mentioned above they are rolled up for each book and with a link to the editions for each. Note that the editions page is another search results type of page, so I would still need to grab book urls from that to pass to the Worker.

The isbn response differences I can get from the response header from the 'location'. For code simplicity I can just add the URL in the case of a match off to the Worker thread. Though that means the worker is firing another request at Goodreads for the same URL I just got the response for. So I think i should allow passing the response into the Worker as an alternative parameter to avoid the extra fetch and bypass some of the stuff the worker does to fetch from a URL in this case?

The title/author vaguely similar results thing is the biggest problem. I handled this in my current plugin by doing a fuzzy type match on the title and author of the search result versus what I was searching for. Because in the situation of a bulk download in the background, I did not want it to retrieve data fir the wrong book just because it was the first result.

Do you have that same issue to handle at all with any of the calibre plugins? I know amazon has the relevance thing, but iirc that is just the order on the search results for which my equivalent would come from the editions page. Am I correct in thinking you assume that any search result will be fine and there are no sanity checks elsewhere in the calibre process?

It may just be a Goodreads thing that they try to be too helpful. For instance they will show a result by an author with the same surname. Now under no circumstances do I want that to be treated as 'good enough' I think. So I will still need to do my own fuzzy sanity checks, right?
kiwidude is offline   Reply With Quote
Old 04-17-2011, 10:05 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 39,997
Karma: 17764952
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
The limitation is in the rest of the download system, it basically keeps only one cover per plugin and throws away the rest. The idea is that the plugin is best suited to figuring out which is the best cover and returning it.

Look at the class InternalMetadataCompareKeygen, it is used to sort results in order of relevance. You can implement an alternative algorithm in your plugin if you feel the need.

Then look at the isbndb plugin which further filters returned results based on the title/author query. When doing filtering remember that the user could just specify a single author name or a couple of words from the title.
kovidgoyal is offline   Reply With Quote
Old 04-17-2011, 11:05 AM   #5
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,258
Karma: 1579358
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite, iPad Pro
Thanks for the pointers Kovid, I will take a look into that title/author stuff. I can't directly override identify_results_keygen for this purpose because it is "too late" (I want to do it while deciding whether to fetch a result, not sorting them after the fact) but hopefully there is some stuff I can steal between that and isbndb. It has always been a "weakness" if doing a title/author match with my Goodreads plugin that if you were too "different" from what they had then I would refuse the match and I put it on the todo list for the rewrite. Can't put it off any more...

FYI and maybe you exepcted this but I tried using the soupparser.fromstring like you did with Amazon and found that it trashed the original html at one scenario making it unusable, so I went back to just using html.fromstring. Specifically it turned this:
Code:
<div ...><span...><p>Some text</p></span><span...><p>More text</p></span></div>

into

<div ...><span...></span><p>Some text</p><span...></span><p>More text</p></div>
So the closing span tags got moved and placed next to the opening ones. Filth. Things work properly using just fromstring though.
kiwidude is offline   Reply With Quote
Advert
Old 04-17-2011, 11:10 AM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 39,997
Karma: 17764952
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I've found soupparser to do a good job with Amazon markup, html.tostring would choke on it. YMMV
kovidgoyal is offline   Reply With Quote
Old 04-17-2011, 11:40 AM   #7
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,258
Karma: 1579358
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite, iPad Pro
Could the test.py be enhanced to actually say *which* test has failed? At the moment it just iterates the test functions and says "No results that passed all tests were found", but no hint as to whether it was title, authors etc or indeed what values it was being given versus expected which would be most useful

I know also that my addition of the swap_author_names function is probably affecting results. I'm guessing you will point that to me to do something about at some point, but if you felt the urge...
kiwidude is offline   Reply With Quote
Old 04-17-2011, 12:06 PM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 39,997
Karma: 17764952
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
How do you see that working? the tests are run on every result returned by the plugin. Different tests could fail on different results, so there' s no clear "test that failed"
kovidgoyal is offline   Reply With Quote
Old 04-17-2011, 12:12 PM   #9
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,258
Karma: 1579358
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite, iPad Pro
What I have done on my local version is put prints statements inside the test functions like this:
Code:
def isbn_test(isbn):
    isbn_ = check_isbn(isbn)

    def test(mi):
        misbn = check_isbn(mi.isbn)
        if misbn and misbn == isbn_:
            return True
        prints('ISBN test failed. Expected: %s found %s'%(isbn_, misbn))
        return False

    return test
kiwidude is offline   Reply With Quote
Old 04-17-2011, 12:35 PM   #10
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 39,997
Karma: 17764952
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
That will generate rather a lot of log spam, but I suppose it's worth it. Send me a patch and I'll merge it.
kovidgoyal is offline   Reply With Quote
Old 04-17-2011, 01:09 PM   #11
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,258
Karma: 1579358
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite, iPad Pro
Ok, will do. By log spam you mean "build log" when you run your tests right? Or are these functions being called from elsewhere too?

I had a struggle with setting pubdate. With Goodreads they have some very variable date formats and I have a bespoke routine that parses them and converts into a year, month, day value. I was using this in my previous plugin which worked fine:
Code:
mi.pubdate = datetime.date(year, month, day)
However it is blowing up inside identify.py as per this:
Spoiler:

Code:
calibre, version 0.7.55
ERROR: Download failed: Failed to download metadata. Click Show Details to see details

Traceback (most recent call last):
  File "D:\CalibreDev\latest\calibre\src\calibre\gui2\metadata\single_download.py", line 365, in run
  File "D:\CalibreDev\latest\calibre\src\calibre\ebooks\metadata\sources\identify.py", line 355, in identify
  File "D:\CalibreDev\latest\calibre\src\calibre\ebooks\metadata\sources\identify.py", line 250, in merge_identify_results
  File "D:\CalibreDev\latest\calibre\src\calibre\ebooks\metadata\sources\identify.py", line 99, in finalize
  File "D:\CalibreDev\latest\calibre\src\calibre\ebooks\metadata\sources\identify.py", line 144, in merge_isbn_results
  File "D:\CalibreDev\latest\calibre\src\calibre\ebooks\metadata\sources\identify.py", line 211, in merge
TypeError: can't compare datetime.datetime to datetime.date

So I changed it to this which seems to work:
Code:
from calibre.utils.date import utc_tz
return datetime.datetime(year, month, day, tzinfo=utc_tz)
I know the other plugins use parse_date, but it won't handle the parsing I have to do. Is my workaround above ok?
kiwidude is offline   Reply With Quote
Old 04-17-2011, 01:12 PM   #12
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 39,997
Karma: 17764952
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Just spam in the test log.

That is fine, the fact that datetime.date worked previously was purely accidental.
kovidgoyal is offline   Reply With Quote
Old 04-17-2011, 01:25 PM   #13
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,258
Karma: 1579358
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite, iPad Pro
Cool, thx Kovid.

FYI I decided to override get_title_tokens as there were a few too many edge cases it created problems for with Goodreads search. The issue is the replacing of all those things with a space. Some examples Charlotte's Web becomes "charlotte+s+web", "1,000 places" becomes "1+000+places", "Catch-22" becomes "Catch 22" and so on. I also strip off stuff in parenthesis to get rid of things like "(Omnibus)" and "(2010)" in the title.

So at the moment my function has something like this:
Code:
title_patterns = [(re.compile(pat, re.IGNORECASE), repl) for pat, repl in
            [
                (r'(\(.*\))', ''),
                (r'\d+(,)\d+', ''),
                (r'(\s-)', ' '),
                (r'''[']''', ''),
                (r'''[:,;+!@#$%^&*(){}.`~"\s\[\]/]''', ' ')
            ]]
for pat, repl in title_patterns:
    title = pat.sub(repl, title)
It obviously isn't perfect but hopefully it does more good than harm
kiwidude is offline   Reply With Quote
Old 04-17-2011, 02:01 PM   #14
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 39,997
Karma: 17764952
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I'm a little leery of the first two regexes.

Deleting anything inside parentheses will break title searching for titles that have parentheses as part of their names.

Also why delete digts,digits?

EDIT: I'm guessing you meant the second regex to be: (r'(\d+),(\d+)', r'\1\2')

Last edited by kovidgoyal; 04-17-2011 at 02:07 PM.
kovidgoyal is offline   Reply With Quote
Old 04-17-2011, 02:11 PM   #15
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 39,997
Karma: 17764952
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Here's an (IMO) better set of regexes:

Code:
                (r'(?i)[({\[](\d{4}|omnibus|anthology|hardcover|paperback|mass\s*market|edition|ed\.)[\])}]', ''),
                (r'(\d+),(\d+)', r'\1\2'),
                (r'(\s-)', ' '),
                (r"'", ''),
                (r'''[:,;+!@#$%^&*(){}.`~"\s\[\]/]''', ' ')
You can add more words to the first regex if you think of any
kovidgoyal is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Metadata scraper plugin api kiwidude Development 5 03-06-2011 12:58 PM
2 metadata questions bucsie Calibre 4 12-20-2010 06:47 AM
Downloading Metadata - couple of questions sadievan Calibre 6 12-14-2010 10:27 PM
More questions on Metadata crutledge Sigil 16 10-23-2010 08:27 PM
calibre now uses the Google Books API to get metadata kovidgoyal Calibre 9 03-23-2009 10:36 PM


All times are GMT -4. The time now is 05:52 PM.


MobileRead.com is a privately owned, operated and funded community.