View Single Post
Old 05-14-2010, 03:16 PM   #1921
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by mwheinz View Post
Yeah - I've been trying traverse the soup with this:

Code:
   def preprocess_html(self, soup):
        for item in soup.body:
            print 'MHEINZ: [[['
            print item
            print ']]] MHEINZ\n\n'
        return soup
I usually just do this:
Code:
   def preprocess_html(self, soup):
            print 'The soup is: ', soup
        return soup
The purpose is to just see the html and pick out what I want to remove.
Quote:
Overall, though, it looks like soup is parsing to a particular depth and then stopping - it looks like there's a vast blob of html that it is treating as a blob of text.
That's why I suggested using preprocess_regexps. You can pick any chunk of the "vast blob" out and discard it. BeautifulSoup does a great job of handling malformed html, but it's not perfect. Trying to discard junk based on tags presumes that the part you want to discard can be identified by tags. If it can't, you can use regexp based methods to match the start and end of the text blob you want to remove, with regex string matching, without regard to whether that blob is marked with tags.
Starson17 is offline