Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 06-26-2019, 04:38 AM   #1
Ubiquity
Member
Ubiquity began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Apr 2019
Device: Android phone
Metadata download plugin help with text encoding disorder

Hello, this is for me difficult to trace, I'm writing metadata download plugin and got stuck at extracting metadata from book details page.
Doing the testing with this book Serhii Plokhy - Chernobyl: The History of a Nuclear Catastrophe

Firstly, the author field seems correctly extracted to authors string and it prints to log as 'Serhii Plokhy', but when constructing a Metadata structure by
Code:
mi = Metadata(title, authors)
the print(mi) at end of fetching prints
Code:
Author(s)           : S & e & r & h & i & i &   & P & l & o & k & h & y
This is weird encoding and I even can't gues from what to what I should convert. The more weird it is that title, fetched exactly the same way is stored in metadata properly. I'm still assuming the web page is in UTF-8 and so are interpreted the Python strings internally.

Secondly, which may be related when parsing other book details like publisher, tags etc. from details table, the data are stored in table
Code:
<tr>
  <td>name</td>
  <td>value</td>
<tr>
I'm iterating through the table and feed mi values deciding the field by extracted name literals. This works when name is only lower ascii, details having name containing acutes or diacritics aren't matched by corresponding names in plugin code. This points out that name is in wrong code page, but again fetched name literals are printed to log in proper form.

Yet another difficulty with debugging, I'm not able to figure out where log.info(...), log.debug(...) and log.error(...) commands print. Calling calibre-debug -opens a textual log after closing Calibre, but the log doesn't contain any debug info printed by anu of these commands. What only works for me is using print(...) instead which appears in %temp%/calibre_XXXXXX/*.log files. I need a clue how to debug log properly.

Last edited by Ubiquity; 06-26-2019 at 04:47 AM.
Ubiquity is offline   Reply With Quote
Old 06-26-2019, 07:16 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,345
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
That indicates you are setting the value of authors to a string instead of a list of strings. And the output of the log statements will go into the metadata dwnload log, which you get by clicking the view log button on the download dialog.
kovidgoyal is offline   Reply With Quote
Advert
Old 06-27-2019, 10:15 AM   #3
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
Quote:
Originally Posted by kovidgoyal View Post
That indicates you are setting the value of authors to a string instead of a list of strings.
As Kovid said. But also keep in mind that when you try to use a string in a place where a list is expected, the string will be turned into a list by splitting each character. Then each character is treated as a separate author name, joined by "&", and potentially gets sorted alphabetically.
eschwartz is offline   Reply With Quote
Old 06-27-2019, 02:42 PM   #4
Ubiquity
Member
Ubiquity began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Apr 2019
Device: Android phone
I wouldnot say that my artist string is in list form. Fetching it and joining to reult string which initializes Metadata structure

Code:
authors = root.xpath('//h2[@class="authornames"]/span/a/text()')
if authors:
  authors = ' & '.join(authors).strip()
.
.
.
mi = Metadata(title, authors)
My issue seem to relate with unknown codepage in which Calibre operates the plugin.

If I parse the detail page in testing standalone script, all acuted string literals print out in proper form. If I print same string constants from plugin source, they print out in broken form, representing higher ascii characters in two byte unreadable chunks. But both plugin sources and testing script are saved in UTF-8 encoding.
Ubiquity is offline   Reply With Quote
Old 06-27-2019, 05:20 PM   #5
thiago.eec
Wizard
thiago.eec ought to be getting tired of karma fortunes by now.thiago.eec ought to be getting tired of karma fortunes by now.thiago.eec ought to be getting tired of karma fortunes by now.thiago.eec ought to be getting tired of karma fortunes by now.thiago.eec ought to be getting tired of karma fortunes by now.thiago.eec ought to be getting tired of karma fortunes by now.thiago.eec ought to be getting tired of karma fortunes by now.thiago.eec ought to be getting tired of karma fortunes by now.thiago.eec ought to be getting tired of karma fortunes by now.thiago.eec ought to be getting tired of karma fortunes by now.thiago.eec ought to be getting tired of karma fortunes by now.
 
Posts: 1,211
Karma: 1419583
Join Date: Dec 2016
Location: Goiânia - Brazil
Device: iPad, Kindle Paperwhite, Kindle Oasis
Quote:
Originally Posted by Ubiquity View Post
I wouldnot say that my artist string is in list form. Fetching it and joining to reult string which initializes Metadata structure

Code:
authors = root.xpath('//h2[@class="authornames"]/span/a/text()')
if authors:
  authors = ' & '.join(authors).strip()
.
.
.
mi = Metadata(title, authors)
What @eschwartz said is that calibre expects authors as a list, so you should format it as such. If you pass a string, then you will have this odd behavior you described.

So, you should initialize your variable as a list, and then append the values, like this example:

Code:
        authors = []
        for author_node in author_nodes:
            authors.append(author_node.text_content().strip())
thiago.eec is offline   Reply With Quote
Advert
Old 06-28-2019, 02:40 PM   #6
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
Quote:
Originally Posted by Ubiquity View Post
I wouldnot say that my artist string is in list form. Fetching it and joining to reult string which initializes Metadata structure

Code:
authors = root.xpath('//h2[@class="authornames"]/span/a/text()')
if authors:
  authors = ' & '.join(authors).strip()
.
.
.
mi = Metadata(title, authors)
So that is exactly the problem. DO NOT JOIN THEM. Metadata() requires that you *not* join them yourself using ' & '.join(), so... you have explicitly converted your isinstance(authors, list) from a form that is useful and good, to a form that is useless and bad.

Quote:
Originally Posted by Ubiquity View Post
My issue seem to relate with unknown codepage in which Calibre operates the plugin.

If I parse the detail page in testing standalone script, all acuted string literals print out in proper form. If I print same string constants from plugin source, they print out in broken form, representing higher ascii characters in two byte unreadable chunks. But both plugin sources and testing script are saved in UTF-8 encoding.
You posted about multiple unrelated issues. I didn't say anything about this issue.
eschwartz is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
LibraryThing metadata download plugin rtype Plugins 30 09-08-2015 07:24 PM
Regarding using metadata objects in identify method of metadata download plugin api aprekates Development 1 07-06-2014 03:35 AM
[Metadata download plugin] CBDB.cz cerda Plugins 0 07-23-2013 11:58 AM
[Metadata Download Plugin] Goodreads Metadata **Deprecated** kiwidude Plugins 30 04-23-2011 02:10 PM


All times are GMT -4. The time now is 09:48 AM.


MobileRead.com is a privately owned, operated and funded community.