Hey Starson17, help!
Quote:
Originally Posted by Starson17
Aren't recipes fun!
It's likely because of the order in which the various stages of the recipe are processed. I've certainly seen this. Once you get to the point where you are building your own pages from the soup (and that's what the multipage does) you don't get the expected behavior.
I believe the keep_only throws away the tags, during the initial page pull, but doesn't apply to the extra pages you are getting with the soup2 = self.index_to_soup(nexturl) step.
I've certainly seen this before. There are lots of solutions, in fact, your recipe already uses one - extract()- to remove a tag. Just find the tags and extract them.
I usually do this at the postprocess_html stage with something like this:
Code:
for tag in soup.findAll('form', dict(attrs={'name':["comments_form"]})):
tag.extract()
for tag in soup.findAll('font', dict(attrs={'id':["cr-other-headlines"]})):
tag.extract()
extract() removes the tag entirely from the original soup, leaving you with two independent soups. In your recipe, you want the extracted tag, but it also works to remove it from the original soup, just like remove_tags.
|
I've been having trouble making this work, adding this to the end of the recipe just breaks it. Can I get a more detailed example, I did read something about first_fetch but not sure how to use it. Is there another recipe I could look at for example?
Code:
def postprocess_html(self, soup):
for tag in soup.findAll('dic', dict(attrs={'class':["article-info clearfix"]})):
tag.extract()
return soup