MobileRead Forums - View Single Post - Custom recipes (archive, read-only)

Starson17 · 05-14-2010, 02:16 PM

Quote:

Originally Posted by mwheinz

Yeah - I've been trying traverse the soup with this:

Code:

   def preprocess_html(self, soup):
        for item in soup.body:
            print 'MHEINZ: [[['
            print item
            print ']]] MHEINZ\n\n'
        return soup

I usually just do this:

Code:

   def preprocess_html(self, soup):
            print 'The soup is: ', soup
        return soup

The purpose is to just see the html and pick out what I want to remove.

Quote:

Overall, though, it looks like soup is parsing to a particular depth and then stopping - it looks like there's a vast blob of html that it is treating as a blob of text.

That's why I suggested using preprocess_regexps. You can pick any chunk of the "vast blob" out and discard it. BeautifulSoup does a great job of handling malformed html, but it's not perfect. Trying to discard junk based on tags presumes that the part you want to discard can be identified by tags. If it can't, you can use regexp based methods to match the start and end of the text blob you want to remove, with regex string matching, without regard to whether that blob is marked with tags.