View Single Post
Old 02-18-2011, 10:45 PM   #5
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,228
Karma: 1334002
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
Ok, I decided to write out the offending http content to a file, and I discovered I was wrong about the cause being a missing CDATA opening element (it must have gone missing somehow when I printed to a debug window).

I have attached the xml file. I believe the problem is perhaps the "special characters" inside the description fields within CDATA. The parse error says line 32 column 25 which makes it look like there is some sort of encoding issue?

It wouldn't be the first time with Goodreads as chaley will attest to - they have a habit of sending headers saying 'utf-8' and then putting non utf-8 characters in. I am already decoding using .decode('utf-8, errors=replace). However while that trick worked for my html web scraping issues it still isn't sufficient for the xml parser to work as coded currently (or the recovery parser).
Attached Files
File Type: xml GR_xml_fail_currently-reading.xml (3.9 KB, 173 views)
kiwidude is offline   Reply With Quote