Quote:
Originally Posted by kidtwisted
Hello everyone.
I need some help with a recipe for this feed:
http://www.pcper.com/rss/articles.rss
Most of the articles span several pages, I've cleaned it up a bit but I'm not sure how to scrape the complete article from the "Click here for the Detailed Review" links. Thanks!
|
You need to use multipage code. Here's an example from the adventuregamers.recipe builtin:
Code:
def append_page(self, soup, appendtag, position):
pager = soup.find('div',attrs={'class':'toolbar_fat_next'})
if pager:
nexturl = self.INDEX + pager.a['href']
soup2 = self.index_to_soup(nexturl)
texttag = soup2.find('div', attrs={'class':'bodytext'})
for it in texttag.findAll(style=True):
del it['style']
newpos = len(texttag.contents)
self.append_page(soup2,texttag,newpos)
texttag.extract()
appendtag.insert(position,texttag)
def preprocess_html(self, soup):
mtag = '<meta http-equiv="Content-Language" content="en-US"/>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>'
soup.head.insert(0,mtag)
for item in soup.findAll(style=True):
del item['style']
self.append_page(soup, soup.body, 3)
pager = soup.find('div',attrs={'class':'toolbar_fat'})
if pager:
pager.extract()
return soup
append_page recursively looks for the next page tag ('div',attrs={'class':'toolbar_fat_next'}), gets the text and inserts it into the soup at the point where the tag was found until all pages have been inserted.
preprocess_html uses append_page to modify the html. You'll need to look for the next page tag on your site and adjust accordingly. This should get you started.
Do your testing with -vv and --test
as in:
ebook-convert pcper.recipe pcper --test -vv> pcper.txt