View Single Post
Old 09-18-2014, 08:58 AM   #9
Divingduck
Wizard
Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.
 
Posts: 1,166
Karma: 1410083
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
Please find attached a new version. HCN have a new web design. I made in addition an extra CSS to get rid of the ugly article design. Hope you will like it.

Spoiler:
Code:
# -*- coding: utf-8 -*-
##
## Written:      2012-01-28
## Last Edited:  2014-09-18
## Remark:       Version 2.0 first check 
##               Update cleanup for new web article design and extra css
##
__license__   = 'GPL v3'
__copyright__ = '2013, Armin Geller'

'''
Fetch High Country News
'''
from calibre.web.feeds.news import BasicNewsRecipe
class HighCountryNews(BasicNewsRecipe):

    title                 = u'High Country News'
    description           = u'High Country News (RSS Version)'
    __author__            = 'Armin Geller'
    publisher             = 'High Country News'
    category              = 'news, politics'
    timefmt               = ' [%a, %d %b %Y]'
    language              = 'en-Us'
    encoding              = 'UTF-8'
    publication_type      = 'newspaper'
    oldest_article        = 14
    max_articles_per_feed = 100
    no_stylesheets        = True 
    auto_cleanup          = False
    remove_javascript     = True
    remove_empty_feeds    = True
    use_embedded_content  = False  
    
    masthead_url          = 'http://www.hcn.org/logo.jpg'
    cover_source          = 'http://www.hcn.org/issues' # AGE 2014-09-18 new
    
    def get_cover_url(self):
       cover_source_soup = self.index_to_soup(self.cover_source)
       preview_image_div = cover_source_soup.find(attrs={'class':'articles'}) # AGE 2014-09-18 new
       return preview_image_div.div.a.figure.img['src'] # AGE 2014-09-18 newm take always the first one (hopefully)

    # AGe new extra css to get rid of ugly style
    # li for delete disc style, 
    # caption and credit for description & author of pictures

    extra_css      =  '''
                      h1 {font-size: 1.6em; text-align: left}
                      h2 {font-size: 1em; font-style: italic; font-weight: normal}
                      h3 {font-size: 1.3em;text-align: left}
                      h4, h5, h6, {font-size: 1em;text-align: left} 
                      li {list-style-type: none}
                      .caption, .credit {font-size: 0.9em; font-style: italic}
                      '''

    feeds = [
              (u'Most recent', u'http://feeds.feedburner.com/hcn/most-recent?format=xml'),
              (u'Current Issue', u'http://feeds.feedburner.com/hcn/current-issue?format=xml'),
              
              (u'From the Blogs', u'http://feeds.feedburner.com/hcn/FromTheBlogs?format=xml'),
              (u'Heard around the West', u'http://feeds.feedburner.com/hcn/heard?format=xml'),
              (u'The GOAT Blog', u'http://feeds.feedburner.com/hcn/goat?format=xml'),
              (u'The Range', u'http://feeds.feedburner.com/hcn/range?format=xml'),

              (u'Writers on the Range', u'http://feeds.feedburner.com/hcn/wotr'),
              (u'High Country Views', u'http://feeds.feedburner.com/hcn/HighCountryViews'),
             ]

    # 2014-09-18 AGe New coding related to design changes
 
    keep_only_tags    = [
                          dict(name='div', attrs={'id':'content'}),
                          dict(name='div', attrs={'class':'opaque'}),
                        ]

    remove_tags = [
                    dict(name='div', attrs={'class':[
																											'large-4 columns right-portlets',
																											'small-12 columns',
																											'pagination-share',
																											'tiny content f-dropdown',
																											'image-viewer-controls',
                                                     ]}),
                    dict(name='ul', attrs={'class':[
																										'document-actions',
																										'topics',
																									]}),
                    dict(name='a', attrs={'name':[
																										'body',
																									]}),

                  ]
 
    # AGE 2014-09-18 this will stay for a while
    # but have no impact for now ... 
    
    INDEX                 = ''
    def append_page(self, soup, appendtag, position):
        pager = soup.find('span',attrs={'class':'next'})
        if pager:
           nexturl = self.INDEX + pager.a['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'article-text'})
           newpos = len(texttag.contents)
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)

    def preprocess_html(self, soup):
        self.append_page(soup, soup.body, 3)
        pager = soup.find('div',attrs={'class':'listingBar listingBar-article'})
        if pager:
           pager.extract()
        return self.adeify_images(soup)
Attached Files
File Type: zip HighCountryNews_AGeV2.0.zip (1.7 KB, 201 views)
Divingduck is offline   Reply With Quote