MobileRead Forums - View Single Post - Custom recipes (archive, read-only)

mwheinz · 05-14-2010, 01:58 PM

Quote:

Originally Posted by Starson17

Malformed html can be problematical. You may want to look at the soup output from preprocess_html and then use preprocess_regexps to delete material you need to get rid of.

Yeah - I've been trying traverse the soup with this:

Code:

   def preprocess_html(self, soup):
        for item in soup.body:
            print 'MHEINZ: [[['
            print item
            print ']]] MHEINZ\n\n'
        return soup

but the output I'm getting is weird - as iff it was processing multiple items at once (while I'm comfortable in various C dialects, I am not a python coder). I'm seeing things like multiple "[[[" lines in a row before a "]]]" line.

Overall, though, it looks like soup is parsing to a particular depth and then stopping - it looks like there's a vast blob of html that it is treating as a blob of text.