MobileRead Forums - View Single Post - Custom recipes (archive, read-only)

Starson17 · 05-31-2010, 12:40 PM

Quote:

Originally Posted by kidtwisted

Hello everyone.

I need some help with a recipe for this feed:
http://www.pcper.com/rss/articles.rss

Most of the articles span several pages, I've cleaned it up a bit but I'm not sure how to scrape the complete article from the "Click here for the Detailed Review" links. Thanks!

You need to use multipage code. Here's an example from the adventuregamers.recipe builtin:

Code:

    def append_page(self, soup, appendtag, position):
        pager = soup.find('div',attrs={'class':'toolbar_fat_next'})
        if pager:
           nexturl = self.INDEX + pager.a['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'bodytext'})
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)          
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)
        
    
    def preprocess_html(self, soup):
        mtag = '<meta http-equiv="Content-Language" content="en-US"/>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>'
        soup.head.insert(0,mtag)    
        for item in soup.findAll(style=True):
            del item['style']
        self.append_page(soup, soup.body, 3)
        pager = soup.find('div',attrs={'class':'toolbar_fat'})
        if pager:
           pager.extract()        
        return soup

append_page recursively looks for the next page tag ('div',attrs={'class':'toolbar_fat_next'}), gets the text and inserts it into the soup at the point where the tag was found until all pages have been inserted.

preprocess_html uses append_page to modify the html. You'll need to look for the next page tag on your site and adjust accordingly. This should get you started.

Do your testing with -vv and --test
as in:
ebook-convert pcper.recipe pcper --test -vv> pcper.txt