MobileRead Forums - View Single Post - Custom recipes (archive, read-only)

nickredding · 01-17-2010, 02:43 PM

The standard recipe for the National Post does not accomodate articles which are continued on a second page (url). Replace the method preprocess_html with the following code to ensure the complete article is downloaded in these cases:

Code:

    def preprocess_html(self, soup):
        story = soup.find(name='div', attrs={'class':'triline'})
        page2_link = soup.find('p','pagenav')
        if page2_link:
            atag = page2_link.find('a',href=True)
            if atag:
                page2_url = atag['href']
                if page2_url.startswith('story'):
                         page2_url = 'http://www.nationalpost.com/todays-paper/'+url
                elif page2_url.startswith( '/todays-paper/story.html'):
                    page2_url = 'http://www.nationalpost.com/'+page2_url   
                page2_soup = self.index_to_soup(page2_url)
                if page2_soup:
                    page2_content = page2_soup.find('div','story-content')
                    if page2_content:
                        full_story = BeautifulSoup('<div></div>')
                        full_story.insert(0,story)
                        full_story.insert(1,page2_content)
                        story = full_story
        soup = BeautifulSoup('<html><head><title>t</title></head><body></body></html>')
        body = soup.find(name='body')
        body.insert(0, story)
        return soup