MobileRead Forums - View Single Post

cryzed · 11-21-2014, 05:29 AM

Did you try explicitly specifying the parser for the BeautifulSoup instance?:

Code:

BeautifulSoup(markup, 'html5lib')

And if I remember correctly, the error occured in the BaseAdapter.utf8FromSoup method. Is the BeautifulSoup instance that is passed to it really a BeautifulSoup 3 or BeautifulSoup 4 instance? It should be entirely dependent on the site adapter calling it.

If all this seems correct, the only thing I can think of is narrowing it down to the element that causes the error and extracting it (possibly via the soup instance if that doesn't already cause an error) before trying to turn the soup into a string, but I think you already tried something like that.

If all this doesn't help I'm a bit stumped, since the html5lib library is supposed to act exactly like a real browser when parsing HTML. I checked the code and there doesn't seem to be anything to indicate that the BeautifulSoup instance is modified improperly (which can easily lead to such errors), is it possibly that the raw HTML modifications the adapter does shortly beforehand at places are at fault?

11-21-2014, 05:29 AM	#3517
cryzed Evangelist Posts: 408 Karma: 1050547 Join Date: Mar 2011 Device: Kindle Oasis 2	Did you try explicitly specifying the parser for the BeautifulSoup instance?: Code: BeautifulSoup(markup, 'html5lib') And if I remember correctly, the error occured in the BaseAdapter.utf8FromSoup method. Is the BeautifulSoup instance that is passed to it really a BeautifulSoup 3 or BeautifulSoup 4 instance? It should be entirely dependent on the site adapter calling it. If all this seems correct, the only thing I can think of is narrowing it down to the element that causes the error and extracting it (possibly via the soup instance if that doesn't already cause an error) before trying to turn the soup into a string, but I think you already tried something like that. If all this doesn't help I'm a bit stumped, since the html5lib library is supposed to act exactly like a real browser when parsing HTML. I checked the code and there doesn't seem to be anything to indicate that the BeautifulSoup instance is modified improperly (which can easily lead to such errors), is it possibly that the raw HTML modifications the adapter does shortly beforehand at places are at fault? Last edited by cryzed; 11-21-2014 at 05:34 AM.