View Single Post
Old 08-18-2013, 09:33 AM   #7
Divingduck
Wizard
Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.
 
Posts: 1,161
Karma: 1404241
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
I made an update for this recipe. The recipe includes now High Country News - Blog. So there is no need to use two recipes for feed content of HCN. In addition I change the method to extract the data. So, some of the articles have pictures back again. As I didn’t found an error since the last 8 weeks, here is the new version:

Spoiler:
Code:
# -*- coding: utf-8 -*-
##
## Written:      2012-01-28
## Last Edited:  2013-08-18
## Remark:       Version 1.2 
##               Integration of former separated Blog-News
##
__license__   = 'GPL v3'
__copyright__ = '2013, Armin Geller'

'''
Fetch High Country News
'''
from calibre.web.feeds.news import BasicNewsRecipe
class HighCountryNews(BasicNewsRecipe):

    title                 = u'High Country News'
    description           = u'High Country News (RSS Version)'
    __author__            = 'Armin Geller'
    publisher             = 'High Country News'
    category              = 'news, politics'
    timefmt               = ' [%a, %d %b %Y]'
    language              = 'en-Us'
    encoding              = 'UTF-8'
    publication_type      = 'newspaper'
    oldest_article        = 14
    max_articles_per_feed = 100
    no_stylesheets        = True 
    auto_cleanup          = False
    remove_javascript     = True
    remove_empty_feeds    = True  # 2013-08-18 AGe add
    use_embedded_content  = False  
    
    masthead_url          = 'http://www.hcn.org/logo.jpg'
    cover_source          = 'http://www.hcn.org'
    
    def get_cover_url(self):
       cover_source_soup = self.index_to_soup(self.cover_source)
       preview_image_div = cover_source_soup.find(attrs={'class':' portaltype-Plone Site content--hcn template-homepage_view'})
       return preview_image_div.div.img['src']

    
    feeds = [
              (u'Most recent', u'http://feeds.feedburner.com/hcn/most-recent?format=xml'),
              (u'Current Issue', u'http://feeds.feedburner.com/hcn/current-issue?format=xml'),
              
              (u'From the Blogs', u'http://feeds.feedburner.com/hcn/FromTheBlogs?format=xml'), # 2013-07-23 AGe add
              (u'Heard around the West', u'http://feeds.feedburner.com/hcn/heard?format=xml'), # 2013-07-23 AGe add
              (u'The GOAT Blog', u'http://feeds.feedburner.com/hcn/goat?format=xml'),          # 2013-07-23 AGe add  
              (u'The Range', u'http://feeds.feedburner.com/hcn/range?format=xml'),             # 2013-07-23 AGe add

              (u'Writers on the Range', u'http://feeds.feedburner.com/hcn/wotr'),
              (u'High Country Views', u'http://feeds.feedburner.com/hcn/HighCountryViews'),
             ]
 
 # 2013-07-23 AGe New coding w/o using print_version
 
    keep_only_tags    = [
                          dict(name='div', attrs={'id':['content']}),
                        ]

    remove_tags = [
                    dict(name='div', attrs={'class':['documentActions supercedeDocumentActions editorialDocumentActions', 
                                                      'documentActions supercedeDocumentActions editorialDocumentActions editorialFooterDocumentActions',
                                                      'article-sidebar',
                                                      'image-viewer-controls nojs',
                                                      'protectedArticleWrapper',
                                                      'visualClear',
                                                     ]})
                  ]
 
    INDEX                 = ''
    def append_page(self, soup, appendtag, position):
        pager = soup.find('span',attrs={'class':'next'})
        print 'AGE-append_page-------------->: ', pager
        if pager:
           nexturl = self.INDEX + pager.a['href']
           print 'AGE--------------->: ', nexturl
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'article-text'})
           newpos = len(texttag.contents)
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)

    def preprocess_html(self, soup):
        self.append_page(soup, soup.body, 3)
        pager = soup.find('div',attrs={'class':'listingBar listingBar-article'})
        if pager:
           pager.extract()
        return self.adeify_images(soup)


Some remarks to HCN and this recipe:

HCN isn't very often updating their content especially the Blogs – unfortunately. If you like to see more articles from the past, you need to modify the entry oldest_article = 14 in the recipe to something what is more appropriate for you. 100 (=days) will result in an 8,3MB EPUB with all actual used feeds. I set it to 14 because it seems that this matches better to the updated content. Anyway, you will find out the best setup for your needs. There is also a part in the feed what is called “High Country views” and in there are entries starting with “West of 100: …” These entries are podcasts which HCN decided to discontinue, unfortunately. They are still available in the feed and I didn’t delete this content. So if you are sitting in front of a PC with Calibre-Viewer, you can use the article link to follow the shown podcasts for listening. Keep in mind to extend oldest_article because the oldest audio file is from February 28, 2011. Available are 15 audio files.

Have a nice Sunday
DivingDuck
Attached Files
File Type: zip HighCountryNews_AGeV1.2.zip (1.5 KB, 193 views)
Divingduck is offline   Reply With Quote