MobileRead Forums - View Single Post - Custom recipes (archive, read-only)

Starson17 · 06-04-2010, 07:23 PM

Quote:

Originally Posted by kidtwisted

I have a couple more questions

Aren't recipes fun!

Quote:

Using the keep_only_tags to isolate the article body then the remove_tags to pick out the bits I don't want works great for the 1st page but the tags removed come back on the 2nd page and the rest of the article.
The tag names are the same as the 1st page, not sure why they're not being removed after the 1st page.

It's likely because of the order in which the various stages of the recipe are processed. I've certainly seen this. Once you get to the point where you are building your own pages from the soup (and that's what the multipage does) you don't get the expected behavior.

I believe the keep_only throws away the tags, during the initial page pull, but doesn't apply to the extra pages you are getting with the soup2 = self.index_to_soup(nexturl) step.

I've certainly seen this before. There are lots of solutions, in fact, your recipe already uses one - extract()- to remove a tag. Just find the tags and extract them.

I usually do this at the postprocess_html stage with something like this:

Code:

        for tag in soup.findAll('form', dict(attrs={'name':["comments_form"]})):
            tag.extract()
        for tag in soup.findAll('font', dict(attrs={'id':["cr-other-headlines"]})):
            tag.extract()

extract() removes the tag entirely from the original soup, leaving you with two independent soups. In your recipe, you want the extracted tag, but it also works to remove it from the original soup, just like remove_tags.

Quote:

2nd question,
I've started the pcper.com recipe and managed to get the multi-page to work on it. the problem on this is after the last page of the article they add a link that takes you back to the home page under the same tag that the pages were scraped from. The links for the pages all start with "article.php?" after the last page the link changes to "content_home.php?".

So is there a way to make the soup only scrape the links that start with "article.php?"?

Thanks

Hmmm. It sounds like you are saying that:

Code:

        pager = soup.find('a',attrs={'class':'next'})
        if pager:

the pager <a> tag on the last page has a href content_home.php? link? If so, why not test if the pager['href'] string contains the string 'article' instead of just if pager:? You can use .find see here.