Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 05-15-2020, 12:03 AM   #1
PatStapleton
Member
PatStapleton began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Nov 2011
Location: Australia
Device: Kindle 4
Updated recipe for ABC News Australia

Hi,

A fellow named Vikas emailed me about the previous recipe which had apparently stopped working.

I've rewritten the recipe so it works again with the latest version of the website.

Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
__license__ = 'GPL v3'
__copyright__ = '2020, Pat Stapleton <pat.stapleton at gmail.com>'
'''
Recipe for ABC News Australia (online)
'''
from calibre.web.feeds.news import BasicNewsRecipe

class ABCNews(BasicNewsRecipe):
    title          = 'ABC News'
    language       = 'en_AU'
    __author__     = 'Pat Stapleton'
    description = 'From the Australian Broadcasting Corporation. The ABC is owned and funded by the Australian Government, but is editorially independent.'
    masthead_url = 'https://www.abc.net.au/cm/lb/8212706/data/news-logo-2017---desktop-print-data.png'
    cover_url = 'https://www.abc.net.au/news/linkableblob/8413676/data/abc-news-og-data.jpg'
    cover_margins = (0,20,'#000000')
    scale_news_images_to_device = True
    oldest_article = 7 #days
    max_articles_per_feed = 100
    publication_type = 'newspaper'

#    auto_cleanup   = True # enable this as a backup option if recipe stops working

#    use_embedded_content = False # if set to true will assume that all the article content is within the feed (i.e. won't try to fetch more data)

    no_stylesheets = True
    remove_javascript = True
    
    keep_only_tags = [dict(id='content')] #the article content is contained in <main id="content" /> tag

    # ************************************
    # Regular expressions for remove_tags:
    # ************************************
    #remove aside tag - used for overlapping boxes within article
    #aside_reg_exp = '^.*aside.*$'

    # ************************************
    # Clear out all the unwanted html tags:
    # ************************************
    remove_tags = [
#        dict(name='aside', attrs={'name': re.compile(aside_reg_exp, re.IGNORECASE)})
        {
            'name': ['meta', 'link', 'noscript', 'aside']
        },
        {
            'attrs': {
                'data-component': ['Ticker', 'PublishedDate', 'Timestamp', 'Link', 'ShareLink', 'ShareUtility', 
                'RelatedStories', 'ArticleTopStories', 'ArticleTopStoriesCard', 'ArticleJustInStories', 
                'RelatedTopics', 'Player', 'ArticleSidebar', 'TopStoriesSidebar', 'UtilityBar']
            }
        }
    ]
    
    # ************************************
    # Tidy up the output to look neat for reading
    # ************************************
    remove_attributes = ['width', 'height', 'style']
    extra_css = '.byline{font-size:smaller;margin-bottom:10px;}.inline-caption{display:block;font-size:smaller;text-decoration: none;}'
 
    # ************************************
    # Fix images (dynamically generated by ABC news)
    # ************************************
    def preprocess_html(self, soup):
        for img in soup.findAll('img', attrs={'data-src': True}):
            for x in img['data-src'].split():
                if '/' in x:
                    img['src'] = x
        return soup
    compress_news_images = True
    
    feeds          = [
        ('Top Stories', 'https://www.abc.net.au/news/feed/45910/rss.xml'),
        ('Politics', 'https://www.abc.net.au/news/feed/1534/rss.xml'),
        ('World', 'https://www.abc.net.au/news/feed/4405318/rss.xml'),
        ('Business', 'https://www.abc.net.au/news/feed/51892/rss.xml'),
        ('Analysis', 'https://www.abc.net.au/news/feed/7571268/rss.xml'),
        ('Sport', 'https://www.abc.net.au/news/feed/2942460/rss.xml'),
        ('Science', 'https://www.abc.net.au/news/feed/8132426/rss.xml'),
        ('Health', 'https://www.abc.net.au/news/feed/9167762/rss.xml'),
        ('Arts and Entertainment', 'https://www.abc.net.au/news/feed/472/rss.xml'),
        ('Fact Check', 'https://www.abc.net.au/news/feed/5306468/rss.xml'),
        ('Adelaide', 'https://www.abc.net.au/news/feed/8057540/rss.xml'),
        ('Brisbane', 'https://www.abc.net.au/news/feed/8053540/rss.xml'),
        ('Canberra', 'https://www.abc.net.au/news/feed/8057234/rss.xml'),
        ('Darwin', 'https://www.abc.net.au/news/feed/8057648/rss.xml'),
        ('Hobart', 'https://www.abc.net.au/news/feed/8054562/rss.xml'),
        ('Melbourne', 'https://www.abc.net.au/news/feed/8057136/rss.xml'),
        ('Perth', 'https://www.abc.net.au/news/feed/8057096/rss.xml'),
        ('Sydney', 'https://www.abc.net.au/news/feed/8055316/rss.xml'),
    ]
PatStapleton is offline   Reply With Quote
Old 05-15-2020, 07:00 PM   #2
PatStapleton
Member
PatStapleton began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Nov 2011
Location: Australia
Device: Kindle 4
Some feedback from Vikas was that it takes a while to download all the articles.

The oldest_article setting is set to 7 days, and previously it was 2, so might be worth changing that if you don't like how long it takes.

You could also comment out the local city feeds (e.g. Sydney, Melbourne, Brisbane) if not wanted.

There are also more images now in the articles than in the old website which will also add slightly to the download time.
PatStapleton is offline   Reply With Quote
Advert
Old 05-16-2020, 04:27 AM   #3
PatStapleton
Member
PatStapleton began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Nov 2011
Location: Australia
Device: Kindle 4
Here's a modified version that reduces the articles to the last 2 days like in the original (instead of 7 days), and also increases the simultaneous downloads to 10 (from default of 5). I've also commented out the local capital city feeds which can be re-enabled as desired.

This should all hopefully make this quicker to download for the average user.

Code:
#!/usr/bin/env python
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function

__license__ = 'GPL v3'
__copyright__ = '2020, Pat Stapleton <pat.stapleton at gmail.com>'
'''
Recipe for ABC News Australia (online)
'''
from calibre.web.feeds.news import BasicNewsRecipe


class ABCNews(BasicNewsRecipe):
    title          = 'ABC News'
    language       = 'en_AU'
    __author__     = 'Pat Stapleton'
    description = 'From the Australian Broadcasting Corporation. The ABC is owned and funded by the Australian Government, but is editorially independent.'
    masthead_url = 'https://www.abc.net.au/cm/lb/8212706/data/news-logo-2017---desktop-print-data.png'
    cover_url = 'https://www.abc.net.au/news/linkableblob/8413676/data/abc-news-og-data.jpg'
    cover_margins = (0,20,'#000000')
    scale_news_images_to_device = True
    oldest_article = 2  # days
    simultaneous_downloads = 10
    max_articles_per_feed = 100
    publication_type = 'newspaper'

#    auto_cleanup   = True # enable this as a backup option if recipe stops working

#    use_embedded_content = False # if set to true will assume that all the article content is within the feed (i.e. won't try to fetch more data)

    no_stylesheets = True
    remove_javascript = True

    keep_only_tags = [dict(id='content')]  # the article content is contained in <main id="content" /> tag

    # ************************************
    # Regular expressions for remove_tags:
    # ************************************
    # remove aside tag - used for overlapping boxes within article
    # aside_reg_exp = '^.*aside.*$'

    # ************************************
    # Clear out all the unwanted html tags:
    # ************************************
    remove_tags = [
#        dict(name='aside', attrs={'name': re.compile(aside_reg_exp, re.IGNORECASE)})
        {
            'name': ['meta', 'link', 'noscript', 'aside']
        },
        {
            'attrs': {
                'data-component': ['Ticker', 'PublishedDate', 'Timestamp', 'Link', 'ShareLink', 'ShareUtility',
                'RelatedStories', 'ArticleTopStories', 'ArticleTopStoriesCard', 'ArticleJustInStories',
                'RelatedTopics', 'Player', 'ArticleSidebar', 'TopStoriesSidebar', 'UtilityBar']
            }
        }
    ]

    # ************************************
    # Tidy up the output to look neat for reading
    # ************************************
    remove_attributes = ['width', 'height', 'style']
    extra_css = '.byline{font-size:smaller;margin-bottom:10px;}.inline-caption{display:block;font-size:smaller;text-decoration: none;}'

    # ************************************
    # Fix images (dynamically generated by ABC news)
    # ************************************
    def preprocess_html(self, soup):
        for img in soup.findAll('img', attrs={'data-src': True}):
            for x in img['data-src'].split():
                if '/' in x:
                    img['src'] = x
        return soup
    compress_news_images = True

    feeds          = [
        ('Top Stories', 'https://www.abc.net.au/news/feed/45910/rss.xml'),
        ('Politics', 'https://www.abc.net.au/news/feed/1534/rss.xml'),
        ('World', 'https://www.abc.net.au/news/feed/4405318/rss.xml'),
        ('Business', 'https://www.abc.net.au/news/feed/51892/rss.xml'),
        ('Analysis', 'https://www.abc.net.au/news/feed/7571268/rss.xml'),
        ('Sport', 'https://www.abc.net.au/news/feed/2942460/rss.xml'),
        ('Science', 'https://www.abc.net.au/news/feed/8132426/rss.xml'),
        ('Health', 'https://www.abc.net.au/news/feed/9167762/rss.xml'),
        ('Arts and Entertainment', 'https://www.abc.net.au/news/feed/472/rss.xml'),
        ('Fact Check', 'https://www.abc.net.au/news/feed/5306468/rss.xml'),
#        ('Adelaide', 'https://www.abc.net.au/news/feed/8057540/rss.xml'), #enable by removing # at start of line
#        ('Brisbane', 'https://www.abc.net.au/news/feed/8053540/rss.xml'), #enable by removing # at start of line
#        ('Canberra', 'https://www.abc.net.au/news/feed/8057234/rss.xml'), #enable by removing # at start of line
#        ('Darwin', 'https://www.abc.net.au/news/feed/8057648/rss.xml'), #enable by removing # at start of line
#        ('Hobart', 'https://www.abc.net.au/news/feed/8054562/rss.xml'), #enable by removing # at start of line
#        ('Melbourne', 'https://www.abc.net.au/news/feed/8057136/rss.xml'), #enable by removing # at start of line
#        ('Perth', 'https://www.abc.net.au/news/feed/8057096/rss.xml'), #enable by removing # at start of line
#        ('Sydney', 'https://www.abc.net.au/news/feed/8055316/rss.xml'), #enable by removing # at start of line
    ]
PatStapleton is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[Enhancement] Add new news sources of ABC NEWS donnie888 Recipes 0 12-23-2012 12:39 AM
Nature news - updated recipe Alexis Recipes 3 10-05-2012 02:36 PM
Updated Hacker News Recipe docgnome Recipes 3 12-16-2011 11:40 PM
Recipe for ABC News (Australia) RedDogInCan Recipes 5 11-20-2011 10:16 AM
Updated Telepolis (News+Artikel) Recipe syntaxis Recipes 8 05-15-2011 06:40 AM


All times are GMT -4. The time now is 02:51 AM.


MobileRead.com is a privately owned, operated and funded community.