View Single Post
Old 05-14-2010, 01:58 PM   #1920
mwheinz
award-winning bozo
mwheinz can program the VCR without an owner's manual.mwheinz can program the VCR without an owner's manual.mwheinz can program the VCR without an owner's manual.mwheinz can program the VCR without an owner's manual.mwheinz can program the VCR without an owner's manual.mwheinz can program the VCR without an owner's manual.mwheinz can program the VCR without an owner's manual.mwheinz can program the VCR without an owner's manual.mwheinz can program the VCR without an owner's manual.mwheinz can program the VCR without an owner's manual.mwheinz can program the VCR without an owner's manual.
 
Posts: 258
Karma: 172703
Join Date: Sep 2009
Location: Philadelphia
Device: Kobo Libra 2
Quote:
Originally Posted by Starson17 View Post
Malformed html can be problematical. You may want to look at the soup output from preprocess_html and then use preprocess_regexps to delete material you need to get rid of.
Yeah - I've been trying traverse the soup with this:

Code:
   def preprocess_html(self, soup):
        for item in soup.body:
            print 'MHEINZ: [[['
            print item
            print ']]] MHEINZ\n\n'
        return soup
but the output I'm getting is weird - as iff it was processing multiple items at once (while I'm comfortable in various C dialects, I am not a python coder). I'm seeing things like multiple "[[[" lines in a row before a "]]]" line.

Overall, though, it looks like soup is parsing to a particular depth and then stopping - it looks like there's a vast blob of html that it is treating as a blob of text.
mwheinz is offline