MobileRead Forums - View Single Post - How to handle badly formed xml from web page?

kiwidude · 02-18-2011, 10:45 PM

Ok, I decided to write out the offending http content to a file, and I discovered I was wrong about the cause being a missing CDATA opening element (it must have gone missing somehow when I printed to a debug window).

I have attached the xml file. I believe the problem is perhaps the "special characters" inside the description fields within CDATA. The parse error says line 32 column 25 which makes it look like there is some sort of encoding issue?

It wouldn't be the first time with Goodreads as chaley will attest to - they have a habit of sending headers saying 'utf-8' and then putting non utf-8 characters in. I am already decoding using .decode('utf-8, errors=replace). However while that trick worked for my html web scraping issues it still isn't sufficient for the xml parser to work as coded currently (or the recovery parser).