Quote:
Originally Posted by Starson17
Malformed html can be problematical. You may want to look at the soup output from preprocess_html and then use preprocess_regexps to delete material you need to get rid of.
|
Yeah - I've been trying traverse the soup with this:
Code:
def preprocess_html(self, soup):
for item in soup.body:
print 'MHEINZ: [[['
print item
print ']]] MHEINZ\n\n'
return soup
but the output I'm getting is weird - as iff it was processing multiple items at once (while I'm comfortable in various C dialects, I am not a python coder). I'm seeing things like multiple "[[[" lines in a row before a "]]]" line.
Overall, though, it looks like soup is parsing to a particular depth and then stopping - it looks like there's a vast blob of html that it is treating as a blob of text.