![]() |
How to handle badly formed xml from web page?
The Goodreads API I use for my Calibre Goodreads sync plugin uses (mainly) xml responses to return the results. However I have found a situation where the xml being returned is "badly formed". It can't be displayed in a web browser, due to the error, nor can it be parsed using ElementTree.
I have traced the problem down to a particular field in the xml which seems to have corrupted content - it is missing the opening <![CDATA[ within the xml text (though it has the closing ]]>). Spoiler:
I've raised this just now as a bug on the Goodreads API forums, but given they don't seem to be very actively responding to issues I want to try to handle this case myself. That particular description field doesn't happen to be one I need the values of. Currently I am using ElementTree to load the http content and retrieve elements, but of course it blows up trying to use et.fromstring() when badly formed, as below: Code:
root = et.fromstring(content) |
Use a recovering parser, grep the calibre source code for RECOVER_PARSER to see examples of its use.
|
Quote:
Perhaps I shall just "gracefully" handle the error with an error dialog and have to wait for Goodreads to pull finger and fix it their side. It has only occurred with one particular book so far but if it happens for one there are bound to be others. |
You can also try using beautifulstonesoup, may be more robust. If it parses successfully, then you can use it to serialize back to xml which should fix the problems for lxml.
But before doing so you will have to give it a list of the self closing tags. |
1 Attachment(s)
Ok, I decided to write out the offending http content to a file, and I discovered I was wrong about the cause being a missing CDATA opening element (it must have gone missing somehow when I printed to a debug window). :smack:
I have attached the xml file. I believe the problem is perhaps the "special characters" inside the description fields within CDATA. The parse error says line 32 column 25 which makes it look like there is some sort of encoding issue? It wouldn't be the first time with Goodreads as chaley will attest to - they have a habit of sending headers saying 'utf-8' and then putting non utf-8 characters in. I am already decoding using .decode('utf-8, errors=replace). However while that trick worked for my html web scraping issues it still isn't sufficient for the xml parser to work as coded currently (or the recovery parser). |
Code:
from calibre.utils.cleantext import clean_ascii_chars |
:thumbsup:
Thanks Kovid - that has it working now. Brilliant. |
| All times are GMT -4. The time now is 10:14 PM. |
Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.