Quote:
Originally Posted by Starson17
You're welcome and good luck. I prefer to help others figure out how to do it than to just write it. If you need help with pcper, let us know, and be sure to post your final results here so Kovid can add it to the code for use by others.
|
I have a couple more questions, I'm cleaning up the tweaktown.com output and ran into a problem. Using the keep_only_tags to isolate the article body then the remove_tags to pick out the bits I don't want works great for the 1st page but the tags removed come back on the 2nd page and the rest of the article.
The tag names are the same as the 1st page, not sure why they're not being removed after the 1st page.
tweaktown recipe code:
Spoiler:
Code:
class AdvancedUserRecipe1273795663(BasicNewsRecipe):
title = u'TweakTown Latest Tech'
description = 'TweakTown Latest Tech'
__author__ = 'KidTwisted'
publisher = 'TweakTown'
category = 'PC Articles, Reviews and Guides'
use_embedded_content = False
max_articles_per_feed = 2
oldest_article = 7
cover_url = 'http://www.tweaktown.com/images/logo_white.gif'
timefmt = ' [%Y %b %d ]'
no_stylesheets = True
language = 'en'
#recursion = 10
remove_javascript = True
conversion_options = { 'linearize_tables' : True}
# reverse_article_order = True
html2lrf_options = [
'--comment', description
, '--category', category
, '--publisher', publisher
]
html2epub_options = 'publisher="' + publisher + '"\ncomments="' + description + '"\ntags="' + category + '"'
keep_only_tags = [dict(name='div', attrs={'id':['article']})]
remove_tags = [ dict(name='html', attrs={'id':'facebook'})
,dict(name='div', attrs={'class':'article-info clearfix'})
,dict(name='select', attrs={'onchange':'location.href=this.options[this.selectedIndex].value'})
,dict(name='div', attrs={'class':'price-grabber'})
,dict(name=['h4'])]
feeds = [ (u'Articles Reviews', u'http://feeds.feedburner.com/TweaktownArticlesReviewsAndGuidesRss20?format=xml') ]
def append_page(self, soup, appendtag, position):
pager = soup.find('a',attrs={'class':'next'})
if pager:
nexturl = pager['href']
soup2 = self.index_to_soup(nexturl)
texttag = soup2.find('div', attrs={'id':'article'})
for it in texttag.findAll(style=True):
del it['style']
newpos = len(texttag.contents)
self.append_page(soup2,texttag,newpos)
texttag.extract()
appendtag.insert(position,texttag)
def preprocess_html(self, soup):
mtag = '<meta http-equiv="Content-Language" content="en-US"/>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>'
soup.head.insert(0,mtag)
for item in soup.findAll(style=True):
del item['style']
self.append_page(soup, soup.body, 3)
pager = soup.find('a',attrs={'class':'next'})
if pager:
pager.extract()
return soup
2nd question,
I've started the pcper.com recipe and managed to get the multi-page to work on it. the problem on this is after the last page of the article they add a link that takes you back to the home page under the same tag that the pages were scraped from. The links for the pages all start with "article.php?" after the last page the link changes to "content_home.php?".
So is there a way to make the soup only scrape the links that start with "article.php?"?
Thanks