Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 02-18-2011, 09:21 PM   #1
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,636
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
How to handle badly formed xml from web page?

The Goodreads API I use for my Calibre Goodreads sync plugin uses (mainly) xml responses to return the results. However I have found a situation where the xml being returned is "badly formed". It can't be displayed in a web browser, due to the error, nor can it be parsed using ElementTree.

I have traced the problem down to a particular field in the xml which seems to have corrupted content - it is missing the opening <![CDATA[ within the xml text (though it has the closing ]]>).
Spoiler:

Code:
<?xml version="1.0" encoding="UTF-8"?>
<GoodreadsResponse>
        <Request>
                <authentication>true</authentication>
                                <key><![CDATA[UxvtOM3ogQWjfgiCnMleA]]></key>
                    <method><![CDATA[review_list]]></method>
        </Request>
        <reviews start="1" end="1" total="1">
    <review>
  <id>149091755</id>
      <book>
  <id type="integer">43912</id>
  <isbn>0006483569</isbn>
  <isbn13>9780006483564</isbn13>
  <publication_day type="integer">3</publication_day>
  <publication_month type="integer">12</publication_month>
  <publication_year type="integer">2001</publication_year>
  <publisher>Voyager</publisher>
  <text_reviews_count type="integer">2</text_reviews_count>
  <title>
    <![CDATA[Krondor: Tear of the Gods (The Riftwar Legacy, #3)]]>
  </title>
  <image_url>http://photo.goodreads.com/books/1249516742m/43912.jpg</image_url>
  <small_image_url>http://photo.goodreads.com/books/1249516742s/43912.jpg</small_image_url>
  <link>http://www.goodreads.com/book/show/43912.Krondor</link>
  <num_pages>384</num_pages>
  <average_rating>3.64</average_rating>
  <ratings_count>1379</ratings_count>
  <description>
<br/>A DROP IN THE OCEAN?A raid upon the high seas signals an attack of unprecedented magnitude by the 
forces of darkness. For the holiest of holies, the Tear of the Gods has been lost to the Temple of Ishap. 
After a raid planned by Bear, one of the most brutal pirates to sail the Bitter Sea, goes dramatically wrong, 
the colossal gems sink below the waves.So begins a story of the Tear of the Gods, the most powerful artifact 
known to the Temples of Midkemia. For it allows the temples to speak with their gods. Without it, they are 
lost for a decade, until another gem is formed in the distant mountains.Squire James, William, and Jazhara, 
new court magician, must seek out the location of this gem, with Brother Solon, a warrior priest of Ishap, and 
Kendaric, the sole member of the Wreckers’ Guild with the power to raise the ship. They are opposed by 
the minions of Sidi, servant of the Dark God, who seeks to possess the Tear for his own ends, or to destroy 
it, denying it to the forces of light.This third tale in The Riftwar Legacy is a breathless race for a priceless treasure. 
It’s a race against time, against the myriad sinister and competing evil forces desperate for the all-powerful 
prize, and ultimately against the fundamentals of nature, which in Midkemia can be as formidable as the 
Gods themselves]]>
  </description>
<authors>
    <author>
    <id>8588</id>
        <name><![CDATA[Raymond E. Feist]]></name>
    <image_url><![CDATA[http://photo.goodreads.com/authors/1190654917p5/8588.jpg]]></image_url>
    <small_image_url><![CDATA[http://photo.goodreads.com/authors/1190654917p2/8588.jpg]]></small_image_url>
    <link><![CDATA[http://www.goodreads.com/author/show/8588.Raymond_E_Feist]]></link>
    <average_rating>3.91</average_rating>
    <ratings_count>64963</ratings_count>
    <text_reviews_count>1751</text_reviews_count>
  </author>
  </authors>  <published>2000</published>
</book>
      <rating>0</rating>
  <votes>0</votes>
    <spoiler_flag>false</spoiler_flag>
  <shelves>
            <shelf name="currently-reading" />
      </shelves>
  <recommended_for><![CDATA[]]></recommended_for>
  <recommended_by><![CDATA[]]></recommended_by>
  <started_at>Fri Feb 18 17:38:12 -0800 2011</started_at>
  <read_at></read_at>
  <date_added>Fri Feb 18 17:38:12 -0800 2011</date_added>
  <date_updated>Fri Feb 18 17:38:12 -0800 2011</date_updated>
  <read_count></read_count>
    <body><![CDATA[]]></body>
    <comments_count>0</comments_count>
  <url><![CDATA[http://www.goodreads.com/review/show/149091755]]></url>
  <link><![CDATA[http://www.goodreads.com/review/show/149091755]]></link>
</review>
  </reviews>
</GoodreadsResponse>

I've raised this just now as a bug on the Goodreads API forums, but given they don't seem to be very actively responding to issues I want to try to handle this case myself. That particular description field doesn't happen to be one I need the values of.

Currently I am using ElementTree to load the http content and retrieve elements, but of course it blows up trying to use et.fromstring() when badly formed, as below:
Code:
            root = et.fromstring(content)
            reviews_node = root.find('reviews')
            if reviews_node is not None:
                total = int(reviews_node.attrib.get('total'))
                end = int(reviews_node.attrib.get('end'))
                book_nodes = reviews_node.findall('review/book')
                for book_node in book_nodes:
                    book = {}
                    goodreads_id = book_node.findtext('id')
                    book['goodreads_id'] = goodreads_id
                    isbn = book_node.findtext('isbn13')
                    book['goodreads_isbn'] = isbn
                    (title, series) = self.convert_goodreads_title_with_series(book_node.findtext('title').strip())
                    book['goodreads_title'] = title
                    book['goodreads_series'] = series
                    # Grab the first author only for now
                    book['goodreads_author'] = book_node.findtext('authors/author/name')
Any suggestions as to what I could do if anything?
kiwidude is offline   Reply With Quote
Old 02-18-2011, 09:25 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,842
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Use a recovering parser, grep the calibre source code for RECOVER_PARSER to see examples of its use.
kovidgoyal is offline   Reply With Quote
Old 02-18-2011, 09:38 PM   #3
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,636
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Quote:
Originally Posted by kovidgoyal View Post
Use a recovering parser, grep the calibre source code for RECOVER_PARSER to see examples of its use.
Thanks Kovid, I just gave it a go but sadly it appears it's "recovery" can't recover this one

Perhaps I shall just "gracefully" handle the error with an error dialog and have to wait for Goodreads to pull finger and fix it their side. It has only occurred with one particular book so far but if it happens for one there are bound to be others.
kiwidude is offline   Reply With Quote
Old 02-18-2011, 10:19 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,842
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You can also try using beautifulstonesoup, may be more robust. If it parses successfully, then you can use it to serialize back to xml which should fix the problems for lxml.


But before doing so you will have to give it a list of the self closing tags.
kovidgoyal is offline   Reply With Quote
Old 02-18-2011, 10:45 PM   #5
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,636
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Ok, I decided to write out the offending http content to a file, and I discovered I was wrong about the cause being a missing CDATA opening element (it must have gone missing somehow when I printed to a debug window).

I have attached the xml file. I believe the problem is perhaps the "special characters" inside the description fields within CDATA. The parse error says line 32 column 25 which makes it look like there is some sort of encoding issue?

It wouldn't be the first time with Goodreads as chaley will attest to - they have a habit of sending headers saying 'utf-8' and then putting non utf-8 characters in. I am already decoding using .decode('utf-8, errors=replace). However while that trick worked for my html web scraping issues it still isn't sufficient for the xml parser to work as coded currently (or the recovery parser).
Attached Files
File Type: xml GR_xml_fail_currently-reading.xml (3.9 KB, 988 views)
kiwidude is offline   Reply With Quote
Old 02-18-2011, 11:53 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,842
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Code:
from calibre.utils.cleantext import clean_ascii_chars
raw = br.open(url).read()
etree.fromstring(clean_ascii_chars(raw))
kovidgoyal is offline   Reply With Quote
Old 02-19-2011, 12:05 AM   #7
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,636
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis


Thanks Kovid - that has it working now. Brilliant.
kiwidude is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Web Server XML data gweminence Calibre 12 02-17-2011 10:11 AM
Web-page uploading to Kindle(DX) si14 Amazon Kindle 2 10-02-2010 09:25 AM
Reading web on Sony? Demystification needed badly reak Sony Reader 5 12-11-2007 03:35 AM
PRS-500 XML hack for clock, joystick page turns, history AND navigation!!!!! Bob Russell Sony Reader Dev Corner 2 08-21-2007 12:49 PM
The first web page of the Internet Colin Dunstan Lounge 1 01-08-2005 11:02 AM


All times are GMT -4. The time now is 12:52 AM.


MobileRead.com is a privately owned, operated and funded community.