Updated recipe for ABC News Australia

PatStapleton · 05-15-2020, 01:03 AM

Hi,

A fellow named Vikas emailed me about the previous recipe which had apparently stopped working.

I've rewritten the recipe so it works again with the latest version of the website.

Code:

#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
__license__ = 'GPL v3'
__copyright__ = '2020, Pat Stapleton <pat.stapleton at gmail.com>'
'''
Recipe for ABC News Australia (online)
'''
from calibre.web.feeds.news import BasicNewsRecipe

class ABCNews(BasicNewsRecipe):
    title          = 'ABC News'
    language       = 'en_AU'
    __author__     = 'Pat Stapleton'
    description = 'From the Australian Broadcasting Corporation. The ABC is owned and funded by the Australian Government, but is editorially independent.'
    masthead_url = 'https://www.abc.net.au/cm/lb/8212706/data/news-logo-2017---desktop-print-data.png'
    cover_url = 'https://www.abc.net.au/news/linkableblob/8413676/data/abc-news-og-data.jpg'
    cover_margins = (0,20,'#000000')
    scale_news_images_to_device = True
    oldest_article = 7 #days
    max_articles_per_feed = 100
    publication_type = 'newspaper'

#    auto_cleanup   = True # enable this as a backup option if recipe stops working

#    use_embedded_content = False # if set to true will assume that all the article content is within the feed (i.e. won't try to fetch more data)

    no_stylesheets = True
    remove_javascript = True
    
    keep_only_tags = [dict(id='content')] #the article content is contained in <main id="content" /> tag

    # ************************************
    # Regular expressions for remove_tags:
    # ************************************
    #remove aside tag - used for overlapping boxes within article
    #aside_reg_exp = '^.*aside.*$'

    # ************************************
    # Clear out all the unwanted html tags:
    # ************************************
    remove_tags = [
#        dict(name='aside', attrs={'name': re.compile(aside_reg_exp, re.IGNORECASE)})
        {
            'name': ['meta', 'link', 'noscript', 'aside']
        },
        {
            'attrs': {
                'data-component': ['Ticker', 'PublishedDate', 'Timestamp', 'Link', 'ShareLink', 'ShareUtility', 
                'RelatedStories', 'ArticleTopStories', 'ArticleTopStoriesCard', 'ArticleJustInStories', 
                'RelatedTopics', 'Player', 'ArticleSidebar', 'TopStoriesSidebar', 'UtilityBar']
            }
        }
    ]
    
    # ************************************
    # Tidy up the output to look neat for reading
    # ************************************
    remove_attributes = ['width', 'height', 'style']
    extra_css = '.byline{font-size:smaller;margin-bottom:10px;}.inline-caption{display:block;font-size:smaller;text-decoration: none;}'
 
    # ************************************
    # Fix images (dynamically generated by ABC news)
    # ************************************
    def preprocess_html(self, soup):
        for img in soup.findAll('img', attrs={'data-src': True}):
            for x in img['data-src'].split():
                if '/' in x:
                    img['src'] = x
        return soup
    compress_news_images = True
    
    feeds          = [
        ('Top Stories', 'https://www.abc.net.au/news/feed/45910/rss.xml'),
        ('Politics', 'https://www.abc.net.au/news/feed/1534/rss.xml'),
        ('World', 'https://www.abc.net.au/news/feed/4405318/rss.xml'),
        ('Business', 'https://www.abc.net.au/news/feed/51892/rss.xml'),
        ('Analysis', 'https://www.abc.net.au/news/feed/7571268/rss.xml'),
        ('Sport', 'https://www.abc.net.au/news/feed/2942460/rss.xml'),
        ('Science', 'https://www.abc.net.au/news/feed/8132426/rss.xml'),
        ('Health', 'https://www.abc.net.au/news/feed/9167762/rss.xml'),
        ('Arts and Entertainment', 'https://www.abc.net.au/news/feed/472/rss.xml'),
        ('Fact Check', 'https://www.abc.net.au/news/feed/5306468/rss.xml'),
        ('Adelaide', 'https://www.abc.net.au/news/feed/8057540/rss.xml'),
        ('Brisbane', 'https://www.abc.net.au/news/feed/8053540/rss.xml'),
        ('Canberra', 'https://www.abc.net.au/news/feed/8057234/rss.xml'),
        ('Darwin', 'https://www.abc.net.au/news/feed/8057648/rss.xml'),
        ('Hobart', 'https://www.abc.net.au/news/feed/8054562/rss.xml'),
        ('Melbourne', 'https://www.abc.net.au/news/feed/8057136/rss.xml'),
        ('Perth', 'https://www.abc.net.au/news/feed/8057096/rss.xml'),
        ('Sydney', 'https://www.abc.net.au/news/feed/8055316/rss.xml'),
    ]

PatStapleton · 05-15-2020, 08:00 PM

Some feedback from Vikas was that it takes a while to download all the articles.

The oldest_article setting is set to 7 days, and previously it was 2, so might be worth changing that if you don't like how long it takes.

You could also comment out the local city feeds (e.g. Sydney, Melbourne, Brisbane) if not wanted.

There are also more images now in the articles than in the old website which will also add slightly to the download time.

PatStapleton · 05-16-2020, 05:27 AM

Here's a modified version that reduces the articles to the last 2 days like in the original (instead of 7 days), and also increases the simultaneous downloads to 10 (from default of 5). I've also commented out the local capital city feeds which can be re-enabled as desired.

This should all hopefully make this quicker to download for the average user.

Code:

#!/usr/bin/env python
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function

__license__ = 'GPL v3'
__copyright__ = '2020, Pat Stapleton <pat.stapleton at gmail.com>'
'''
Recipe for ABC News Australia (online)
'''
from calibre.web.feeds.news import BasicNewsRecipe


class ABCNews(BasicNewsRecipe):
    title          = 'ABC News'
    language       = 'en_AU'
    __author__     = 'Pat Stapleton'
    description = 'From the Australian Broadcasting Corporation. The ABC is owned and funded by the Australian Government, but is editorially independent.'
    masthead_url = 'https://www.abc.net.au/cm/lb/8212706/data/news-logo-2017---desktop-print-data.png'
    cover_url = 'https://www.abc.net.au/news/linkableblob/8413676/data/abc-news-og-data.jpg'
    cover_margins = (0,20,'#000000')
    scale_news_images_to_device = True
    oldest_article = 2  # days
    simultaneous_downloads = 10
    max_articles_per_feed = 100
    publication_type = 'newspaper'

#    auto_cleanup   = True # enable this as a backup option if recipe stops working

#    use_embedded_content = False # if set to true will assume that all the article content is within the feed (i.e. won't try to fetch more data)

    no_stylesheets = True
    remove_javascript = True

    keep_only_tags = [dict(id='content')]  # the article content is contained in <main id="content" /> tag

    # ************************************
    # Regular expressions for remove_tags:
    # ************************************
    # remove aside tag - used for overlapping boxes within article
    # aside_reg_exp = '^.*aside.*$'

    # ************************************
    # Clear out all the unwanted html tags:
    # ************************************
    remove_tags = [
#        dict(name='aside', attrs={'name': re.compile(aside_reg_exp, re.IGNORECASE)})
        {
            'name': ['meta', 'link', 'noscript', 'aside']
        },
        {
            'attrs': {
                'data-component': ['Ticker', 'PublishedDate', 'Timestamp', 'Link', 'ShareLink', 'ShareUtility',
                'RelatedStories', 'ArticleTopStories', 'ArticleTopStoriesCard', 'ArticleJustInStories',
                'RelatedTopics', 'Player', 'ArticleSidebar', 'TopStoriesSidebar', 'UtilityBar']
            }
        }
    ]

    # ************************************
    # Tidy up the output to look neat for reading
    # ************************************
    remove_attributes = ['width', 'height', 'style']
    extra_css = '.byline{font-size:smaller;margin-bottom:10px;}.inline-caption{display:block;font-size:smaller;text-decoration: none;}'

    # ************************************
    # Fix images (dynamically generated by ABC news)
    # ************************************
    def preprocess_html(self, soup):
        for img in soup.findAll('img', attrs={'data-src': True}):
            for x in img['data-src'].split():
                if '/' in x:
                    img['src'] = x
        return soup
    compress_news_images = True

    feeds          = [
        ('Top Stories', 'https://www.abc.net.au/news/feed/45910/rss.xml'),
        ('Politics', 'https://www.abc.net.au/news/feed/1534/rss.xml'),
        ('World', 'https://www.abc.net.au/news/feed/4405318/rss.xml'),
        ('Business', 'https://www.abc.net.au/news/feed/51892/rss.xml'),
        ('Analysis', 'https://www.abc.net.au/news/feed/7571268/rss.xml'),
        ('Sport', 'https://www.abc.net.au/news/feed/2942460/rss.xml'),
        ('Science', 'https://www.abc.net.au/news/feed/8132426/rss.xml'),
        ('Health', 'https://www.abc.net.au/news/feed/9167762/rss.xml'),
        ('Arts and Entertainment', 'https://www.abc.net.au/news/feed/472/rss.xml'),
        ('Fact Check', 'https://www.abc.net.au/news/feed/5306468/rss.xml'),
#        ('Adelaide', 'https://www.abc.net.au/news/feed/8057540/rss.xml'), #enable by removing # at start of line
#        ('Brisbane', 'https://www.abc.net.au/news/feed/8053540/rss.xml'), #enable by removing # at start of line
#        ('Canberra', 'https://www.abc.net.au/news/feed/8057234/rss.xml'), #enable by removing # at start of line
#        ('Darwin', 'https://www.abc.net.au/news/feed/8057648/rss.xml'), #enable by removing # at start of line
#        ('Hobart', 'https://www.abc.net.au/news/feed/8054562/rss.xml'), #enable by removing # at start of line
#        ('Melbourne', 'https://www.abc.net.au/news/feed/8057136/rss.xml'), #enable by removing # at start of line
#        ('Perth', 'https://www.abc.net.au/news/feed/8057096/rss.xml'), #enable by removing # at start of line
#        ('Sydney', 'https://www.abc.net.au/news/feed/8055316/rss.xml'), #enable by removing # at start of line
    ]

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[Enhancement] Add new news sources of ABC NEWS	donnie888	Recipes	0	12-23-2012 01:39 AM
Nature news - updated recipe	Alexis	Recipes	3	10-05-2012 03:36 PM
Updated Hacker News Recipe	docgnome	Recipes	3	12-17-2011 12:40 AM
Recipe for ABC News (Australia)	RedDogInCan	Recipes	5	11-20-2011 11:16 AM
Updated Telepolis (News+Artikel) Recipe	syntaxis	Recipes	8	05-15-2011 07:40 AM

05-15-2020, 08:00 PM	#2
PatStapleton Member Posts: 22 Karma: 10 Join Date: Nov 2011 Location: Australia Device: Kindle 4	Some feedback from Vikas was that it takes a while to download all the articles. The oldest_article setting is set to 7 days, and previously it was 2, so might be worth changing that if you don't like how long it takes. You could also comment out the local city feeds (e.g. Sydney, Melbourne, Brisbane) if not wanted. There are also more images now in the articles than in the old website which will also add slightly to the download time.

Advert