View Single Post
Old 06-04-2010, 07:23 PM   #2049
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kidtwisted View Post
I have a couple more questions
Aren't recipes fun!

Quote:
Using the keep_only_tags to isolate the article body then the remove_tags to pick out the bits I don't want works great for the 1st page but the tags removed come back on the 2nd page and the rest of the article.
The tag names are the same as the 1st page, not sure why they're not being removed after the 1st page.
It's likely because of the order in which the various stages of the recipe are processed. I've certainly seen this. Once you get to the point where you are building your own pages from the soup (and that's what the multipage does) you don't get the expected behavior.

I believe the keep_only throws away the tags, during the initial page pull, but doesn't apply to the extra pages you are getting with the soup2 = self.index_to_soup(nexturl) step.

I've certainly seen this before. There are lots of solutions, in fact, your recipe already uses one - extract()- to remove a tag. Just find the tags and extract them.

I usually do this at the postprocess_html stage with something like this:
Code:
        for tag in soup.findAll('form', dict(attrs={'name':["comments_form"]})):
            tag.extract()
        for tag in soup.findAll('font', dict(attrs={'id':["cr-other-headlines"]})):
            tag.extract()
extract() removes the tag entirely from the original soup, leaving you with two independent soups. In your recipe, you want the extracted tag, but it also works to remove it from the original soup, just like remove_tags.

Quote:
2nd question,
I've started the pcper.com recipe and managed to get the multi-page to work on it. the problem on this is after the last page of the article they add a link that takes you back to the home page under the same tag that the pages were scraped from. The links for the pages all start with "article.php?" after the last page the link changes to "content_home.php?".

So is there a way to make the soup only scrape the links that start with "article.php?"?

Thanks
Hmmm. It sounds like you are saying that:
Code:
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
the pager <a> tag on the last page has a href content_home.php? link? If so, why not test if the pager['href'] string contains the string 'article' instead of just if pager:? You can use .find see here.
Starson17 is offline