Quote:
Originally Posted by mwheinz
Yeah - I've been trying traverse the soup with this:
Code:
def preprocess_html(self, soup):
for item in soup.body:
print 'MHEINZ: [[['
print item
print ']]] MHEINZ\n\n'
return soup
|
I usually just do this:
Code:
def preprocess_html(self, soup):
print 'The soup is: ', soup
return soup
The purpose is to just see the html and pick out what I want to remove.
Quote:
Overall, though, it looks like soup is parsing to a particular depth and then stopping - it looks like there's a vast blob of html that it is treating as a blob of text.
|
That's why I suggested using preprocess_regexps. You can pick any chunk of the "vast blob" out and discard it. BeautifulSoup does a great job of handling malformed html, but it's not perfect. Trying to discard junk based on tags presumes that the part you want to discard can be identified by tags. If it can't, you can use regexp based methods to match the start and end of the text blob you want to remove, with regex string matching, without regard to whether that blob is marked with tags.