View Single Post
Old 06-04-2010, 03:39 PM   #2048
kidtwisted
Member
kidtwisted began at the beginning.
 
kidtwisted's Avatar
 
Posts: 16
Karma: 10
Join Date: May 2010
Location: Southern California
Device: JetBook-Lite
Quote:
Originally Posted by Starson17 View Post
You're welcome and good luck. I prefer to help others figure out how to do it than to just write it. If you need help with pcper, let us know, and be sure to post your final results here so Kovid can add it to the code for use by others.
I have a couple more questions, I'm cleaning up the tweaktown.com output and ran into a problem. Using the keep_only_tags to isolate the article body then the remove_tags to pick out the bits I don't want works great for the 1st page but the tags removed come back on the 2nd page and the rest of the article.
The tag names are the same as the 1st page, not sure why they're not being removed after the 1st page.
tweaktown recipe code:
Spoiler:
Code:
class AdvancedUserRecipe1273795663(BasicNewsRecipe):
    title = u'TweakTown Latest Tech'
    description = 'TweakTown Latest Tech'
    __author__ = 'KidTwisted'
    publisher             = 'TweakTown'
    category              = 'PC Articles, Reviews and Guides'
    use_embedded_content   = False
    max_articles_per_feed = 2
    oldest_article = 7
    cover_url      = 'http://www.tweaktown.com/images/logo_white.gif'
    timefmt  = ' [%Y %b %d ]'
    no_stylesheets = True
    language = 'en'
    #recursion             = 10
    remove_javascript = True
    conversion_options = { 'linearize_tables' : True}
   # reverse_article_order = True

    html2lrf_options = [
                          '--comment', description
                        , '--category', category
                        , '--publisher', publisher
                        ]
    
    html2epub_options = 'publisher="' + publisher + '"\ncomments="' + description + '"\ntags="' + category + '"' 
	
    keep_only_tags = [dict(name='div', attrs={'id':['article']})]

    remove_tags = [ dict(name='html', attrs={'id':'facebook'})
				   ,dict(name='div', attrs={'class':'article-info clearfix'})
				   ,dict(name='select', attrs={'onchange':'location.href=this.options[this.selectedIndex].value'})
				   ,dict(name='div', attrs={'class':'price-grabber'})
				   ,dict(name=['h4'])]
    feeds =  [ (u'Articles Reviews', u'http://feeds.feedburner.com/TweaktownArticlesReviewsAndGuidesRss20?format=xml') ]
    
    def append_page(self, soup, appendtag, position):
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
           nexturl = pager['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'id':'article'})
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)          
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)
        
    
    def preprocess_html(self, soup):
        mtag = '<meta http-equiv="Content-Language" content="en-US"/>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>'
        soup.head.insert(0,mtag)    
        for item in soup.findAll(style=True):
            del item['style']
        self.append_page(soup, soup.body, 3)
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
           pager.extract()        
        return soup



2nd question,
I've started the pcper.com recipe and managed to get the multi-page to work on it. the problem on this is after the last page of the article they add a link that takes you back to the home page under the same tag that the pages were scraped from. The links for the pages all start with "article.php?" after the last page the link changes to "content_home.php?".

So is there a way to make the soup only scrape the links that start with "article.php?"?

Thanks
kidtwisted is offline