Metatdata API - prioritising results

kiwidude · 05-09-2011, 08:48 AM

Having spent about 3 hours of adding print statements and tracing through this code I am pulling my hair out so perhaps (most likely Kovid) can enlighten me.

I have a book which returns three ISBN matches from B&N:
9781101135570
9780451229939
9780594256113 - has no cover

These results then get sorted by the InternalMetadataCompareKeyGen to:
9780451229939
9781101135570
9780594256113

All good so far - the results are sorted in order of "preferred match".

Then the lookup via XISBN takes place, and it finds the ISBNs ending in 939/570 are in the same "pool". So it takes the first result and discards the second. So now we have:
9780451229939
9780594256113

Now the final merge takes place. HOWEVER the merge of identifiers in ISBNMerge.merge() is done by this code:

Code:

        for r in results:
            ans.identifiers.update(r.identifiers)

Which effectively says take the LAST isbn value from the results it is given, as each "update" will overwrite the ISBN set previously?

So now my final result being given back is given the ISBN of 9780594256113 - which is the ISBN that does NOT have a cover, and is my least preferred match.

Is this a bug or am I missing something?

chaley · 05-09-2011, 09:17 AM

Just for grins, what happens if you change that code to be:

Code:

for r in reversed(results):
    ans.identifiers.update(r.identifiers)

My thought is that if results is sorted highest-priority first, then processing them in reverse will leave the one you want. However, if identifiers is not ordered, then this suggestion is bogus.

EDIT: Never mind. 'results' is not a list.

kiwidude · 05-09-2011, 09:28 AM

Hi Charles - I'm not sure what you mean by "identifiers is not ordered"?

Certainly that now gives me the result I want. I guess the question is was this intentional or an accidental oversight by Kovid?

It seems "all bets are off" when it comes to merge time - it isn't the case of take the first result then merge the rest into it which is what I would have thought from how the "old code" used to work. It is a case of merging each field in isolation which could come from any result left at that point (after XISBN pool rationalising). So your net result could be a "mish-mash" of data from the results you have returned. Perhaps most of the time this isn't a problem, it just wasn't quite what I assumed would be the effect of prioritising results.

chaley · 05-09-2011, 09:36 AM

Never mind again. I thought that "results" was a dict, but it is instead a list, so reversing it may make sense.

The problem with reversing dicts is that the key traversal order is not defined. It isn't particularly useful reversing an undefined order.

Your original comment, "most likely Kovid" is probably right. It isn't clear to me that the order of items in 'result' is significant. It might be accidental that the one you want is or isn't first.

Edit: I edited, then you edited, then I edited. Is this a multi-threaded conversation?

kiwidude · 05-09-2011, 10:04 AM

Haha, yeah hard to keep track of isn't it?

My initial response would be automatically to say that yes, order is significant. However that response is based around an assumption that on inspection of the code as per my edits is incorrect. Right now it seems to me that the only place order is being "respected" is in the part of the code that creates the XISBN pools - anything but your first result for a pool will be discarded.

I would have "thought" that priority should continue to play a part when it came to merging identifiers as well - as per my example above of the two results I am left with, one is my "excellent" match with a high quality cover and lots of good metadata, the other is a less quality match from having no cover (it could also be the case it has very little metadata as well). So as a user you would want the hyperlinked id's of ISBN/B&N/Goodreads or whatever in the book details panel to be going to your "best match", likewise if you did ctrl+d on it again to get fresh metadata it will now use the ISBN as a lookup so again you want your "best".

I never found this issue with my Goodreads plugin because I only return one result (well unless you enable the option to search multiple editions, but as that is slower I turned it off by default I tested it less).

Perhaps it is just a simple Kovid oversight, it is not often I have the confidence in my understanding of the code to call it officially as a bug

kovidgoyal · 05-09-2011, 10:56 AM

I'm confused, I cannot look at the code right now, but IIRC, merging only happens for results in the same pool. Either ISBN pool or title/author pool.

In the first case what you describe cannot happen since the results are in separate pools. Is it happening in the second case and if it is, then I'm not sure what can be done about it since in general the results are the result of merging metadata from different metadata sources and therefore comparing priorities is meaningless.

The only fix I can see for this is to have a pre ISBN merge filter that throws away lower priority results from each source when a result wth the same title and author exists that has a higher priority.

kiwidude · 05-09-2011, 11:22 AM

It is a title/author search, with multiple results from the same source (I only have one source enabled atm).

That third result (which ends up being the "chosen result") has no matches in XISBN, so ends up in a "pool of its own" rather than being merged with the other two. Quite why that is I don't know, it is just a hardback edition of the same book, perhaps either B&N have the wrong ISBN or the XISBN database is out of date. Or maybe that is expected.

Here is the search results on B&N.

I confess to not entirely understanding all the voodoo going on underneath or its intentions. I can only tell you what my print statements are saying is getting merged at various points in the process

And the net result is the "wrong" one in this case imho.

kovidgoyal · 05-09-2011, 11:31 AM

Well, like I said, the only solution is to throw away results with the same title/author and lower priority from the same source, before merging. Open a ticket for it and I will implement it when i return.

05-09-2011, 08:48 AM	#1
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Metatdata API - prioritising results Having spent about 3 hours of adding print statements and tracing through this code I am pulling my hair out so perhaps (most likely Kovid) can enlighten me. I have a book which returns three ISBN matches from B&N: 9781101135570 9780451229939 9780594256113 - has no cover These results then get sorted by the InternalMetadataCompareKeyGen to: 9780451229939 9781101135570 9780594256113 All good so far - the results are sorted in order of "preferred match". Then the lookup via XISBN takes place, and it finds the ISBNs ending in 939/570 are in the same "pool". So it takes the first result and discards the second. So now we have: 9780451229939 9780594256113 Now the final merge takes place. HOWEVER the merge of identifiers in ISBNMerge.merge() is done by this code: Code: for r in results: ans.identifiers.update(r.identifiers) Which effectively says take the LAST isbn value from the results it is given, as each "update" will overwrite the ISBN set previously? So now my final result being given back is given the ISBN of 9780594256113 - which is the ISBN that does NOT have a cover, and is my least preferred match. Is this a bug or am I missing something?

05-09-2011, 09:17 AM	#2
chaley Grand Sorcerer Posts: 11,742 Karma: 6997045 Join Date: Jan 2010 Location: Notts, England Device: Kobo Libra 2	Just for grins, what happens if you change that code to be: Code: for r in reversed(results): ans.identifiers.update(r.identifiers) My thought is that if results is sorted highest-priority first, then processing them in reverse will leave the one you want. However, if identifiers is not ordered, then this suggestion is bogus. EDIT: Never mind. 'results' is not a list. Last edited by chaley; 05-09-2011 at 09:21 AM.

05-09-2011, 09:28 AM	#3
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Hi Charles - I'm not sure what you mean by "identifiers is not ordered"? Certainly that now gives me the result I want. I guess the question is was this intentional or an accidental oversight by Kovid? It seems "all bets are off" when it comes to merge time - it isn't the case of take the first result then merge the rest into it which is what I would have thought from how the "old code" used to work. It is a case of merging each field in isolation which could come from any result left at that point (after XISBN pool rationalising). So your net result could be a "mish-mash" of data from the results you have returned. Perhaps most of the time this isn't a problem, it just wasn't quite what I assumed would be the effect of prioritising results. Last edited by kiwidude; 05-09-2011 at 09:32 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
calibre's new plugin API	kovidgoyal	Plugins	26	05-07-2011 02:43 PM
New metadata API in 0.8 questions	kiwidude	Development	38	04-18-2011 10:42 AM
ePubs and Google Font API	Justin Rotkowitz	ePub	1	03-29-2011 11:33 AM
Goodreads has published an API	EricLandes	Calibre	6	01-12-2011 04:39 PM
Ubook plugin api	Dopedangel	Reading and Management	0	08-25-2007 06:54 AM

05-09-2011, 09:36 AM	#4
chaley Grand Sorcerer Posts: 11,742 Karma: 6997045 Join Date: Jan 2010 Location: Notts, England Device: Kobo Libra 2	Never mind again. I thought that "results" was a dict, but it is instead a list, so reversing it may make sense. The problem with reversing dicts is that the key traversal order is not defined. It isn't particularly useful reversing an undefined order. Your original comment, "most likely Kovid" is probably right. It isn't clear to me that the order of items in 'result' is significant. It might be accidental that the one you want is or isn't first. Edit: I edited, then you edited, then I edited. Is this a multi-threaded conversation?

05-09-2011, 10:04 AM	#5
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Haha, yeah hard to keep track of isn't it? My initial response would be automatically to say that yes, order is significant. However that response is based around an assumption that on inspection of the code as per my edits is incorrect. Right now it seems to me that the only place order is being "respected" is in the part of the code that creates the XISBN pools - anything but your first result for a pool will be discarded. I would have "thought" that priority should continue to play a part when it came to merging identifiers as well - as per my example above of the two results I am left with, one is my "excellent" match with a high quality cover and lots of good metadata, the other is a less quality match from having no cover (it could also be the case it has very little metadata as well). So as a user you would want the hyperlinked id's of ISBN/B&N/Goodreads or whatever in the book details panel to be going to your "best match", likewise if you did ctrl+d on it again to get fresh metadata it will now use the ISBN as a lookup so again you want your "best". I never found this issue with my Goodreads plugin because I only return one result (well unless you enable the option to search multiple editions, but as that is slower I turned it off by default I tested it less). Perhaps it is just a simple Kovid oversight, it is not often I have the confidence in my understanding of the code to call it officially as a bug

05-09-2011, 10:56 AM	#6
kovidgoyal creator of calibre Posts: 43,866 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I'm confused, I cannot look at the code right now, but IIRC, merging only happens for results in the same pool. Either ISBN pool or title/author pool. In the first case what you describe cannot happen since the results are in separate pools. Is it happening in the second case and if it is, then I'm not sure what can be done about it since in general the results are the result of merging metadata from different metadata sources and therefore comparing priorities is meaningless. The only fix I can see for this is to have a pre ISBN merge filter that throws away lower priority results from each source when a result wth the same title and author exists that has a higher priority.

05-09-2011, 11:22 AM	#7
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	It is a title/author search, with multiple results from the same source (I only have one source enabled atm). That third result (which ends up being the "chosen result") has no matches in XISBN, so ends up in a "pool of its own" rather than being merged with the other two. Quite why that is I don't know, it is just a hardback edition of the same book, perhaps either B&N have the wrong ISBN or the XISBN database is out of date. Or maybe that is expected. Here is the search results on B&N. I confess to not entirely understanding all the voodoo going on underneath or its intentions. I can only tell you what my print statements are saying is getting merged at various points in the process And the net result is the "wrong" one in this case imho.

05-09-2011, 11:31 AM	#8
kovidgoyal creator of calibre Posts: 43,866 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Well, like I said, the only solution is to throw away results with the same title/author and lower priority from the same source, before merging. Open a ticket for it and I will implement it when i return.