![]() |
#1 |
Member
![]() Posts: 22
Karma: 10
Join Date: Nov 2011
Location: Australia
Device: Kindle 4
|
Updated recipe for ABC News Australia
Hi,
A fellow named Vikas emailed me about the previous recipe which had apparently stopped working. I've rewritten the recipe so it works again with the latest version of the website. Code:
#!/usr/bin/env python2 # vim:fileencoding=utf-8 from __future__ import unicode_literals, division, absolute_import, print_function __license__ = 'GPL v3' __copyright__ = '2020, Pat Stapleton <pat.stapleton at gmail.com>' ''' Recipe for ABC News Australia (online) ''' from calibre.web.feeds.news import BasicNewsRecipe class ABCNews(BasicNewsRecipe): title = 'ABC News' language = 'en_AU' __author__ = 'Pat Stapleton' description = 'From the Australian Broadcasting Corporation. The ABC is owned and funded by the Australian Government, but is editorially independent.' masthead_url = 'https://www.abc.net.au/cm/lb/8212706/data/news-logo-2017---desktop-print-data.png' cover_url = 'https://www.abc.net.au/news/linkableblob/8413676/data/abc-news-og-data.jpg' cover_margins = (0,20,'#000000') scale_news_images_to_device = True oldest_article = 7 #days max_articles_per_feed = 100 publication_type = 'newspaper' # auto_cleanup = True # enable this as a backup option if recipe stops working # use_embedded_content = False # if set to true will assume that all the article content is within the feed (i.e. won't try to fetch more data) no_stylesheets = True remove_javascript = True keep_only_tags = [dict(id='content')] #the article content is contained in <main id="content" /> tag # ************************************ # Regular expressions for remove_tags: # ************************************ #remove aside tag - used for overlapping boxes within article #aside_reg_exp = '^.*aside.*$' # ************************************ # Clear out all the unwanted html tags: # ************************************ remove_tags = [ # dict(name='aside', attrs={'name': re.compile(aside_reg_exp, re.IGNORECASE)}) { 'name': ['meta', 'link', 'noscript', 'aside'] }, { 'attrs': { 'data-component': ['Ticker', 'PublishedDate', 'Timestamp', 'Link', 'ShareLink', 'ShareUtility', 'RelatedStories', 'ArticleTopStories', 'ArticleTopStoriesCard', 'ArticleJustInStories', 'RelatedTopics', 'Player', 'ArticleSidebar', 'TopStoriesSidebar', 'UtilityBar'] } } ] # ************************************ # Tidy up the output to look neat for reading # ************************************ remove_attributes = ['width', 'height', 'style'] extra_css = '.byline{font-size:smaller;margin-bottom:10px;}.inline-caption{display:block;font-size:smaller;text-decoration: none;}' # ************************************ # Fix images (dynamically generated by ABC news) # ************************************ def preprocess_html(self, soup): for img in soup.findAll('img', attrs={'data-src': True}): for x in img['data-src'].split(): if '/' in x: img['src'] = x return soup compress_news_images = True feeds = [ ('Top Stories', 'https://www.abc.net.au/news/feed/45910/rss.xml'), ('Politics', 'https://www.abc.net.au/news/feed/1534/rss.xml'), ('World', 'https://www.abc.net.au/news/feed/4405318/rss.xml'), ('Business', 'https://www.abc.net.au/news/feed/51892/rss.xml'), ('Analysis', 'https://www.abc.net.au/news/feed/7571268/rss.xml'), ('Sport', 'https://www.abc.net.au/news/feed/2942460/rss.xml'), ('Science', 'https://www.abc.net.au/news/feed/8132426/rss.xml'), ('Health', 'https://www.abc.net.au/news/feed/9167762/rss.xml'), ('Arts and Entertainment', 'https://www.abc.net.au/news/feed/472/rss.xml'), ('Fact Check', 'https://www.abc.net.au/news/feed/5306468/rss.xml'), ('Adelaide', 'https://www.abc.net.au/news/feed/8057540/rss.xml'), ('Brisbane', 'https://www.abc.net.au/news/feed/8053540/rss.xml'), ('Canberra', 'https://www.abc.net.au/news/feed/8057234/rss.xml'), ('Darwin', 'https://www.abc.net.au/news/feed/8057648/rss.xml'), ('Hobart', 'https://www.abc.net.au/news/feed/8054562/rss.xml'), ('Melbourne', 'https://www.abc.net.au/news/feed/8057136/rss.xml'), ('Perth', 'https://www.abc.net.au/news/feed/8057096/rss.xml'), ('Sydney', 'https://www.abc.net.au/news/feed/8055316/rss.xml'), ] |
![]() |
![]() |
![]() |
#2 |
Member
![]() Posts: 22
Karma: 10
Join Date: Nov 2011
Location: Australia
Device: Kindle 4
|
Some feedback from Vikas was that it takes a while to download all the articles.
The oldest_article setting is set to 7 days, and previously it was 2, so might be worth changing that if you don't like how long it takes. You could also comment out the local city feeds (e.g. Sydney, Melbourne, Brisbane) if not wanted. There are also more images now in the articles than in the old website which will also add slightly to the download time. |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Member
![]() Posts: 22
Karma: 10
Join Date: Nov 2011
Location: Australia
Device: Kindle 4
|
Here's a modified version that reduces the articles to the last 2 days like in the original (instead of 7 days), and also increases the simultaneous downloads to 10 (from default of 5). I've also commented out the local capital city feeds which can be re-enabled as desired.
This should all hopefully make this quicker to download for the average user. Code:
#!/usr/bin/env python # vim:fileencoding=utf-8 from __future__ import unicode_literals, division, absolute_import, print_function __license__ = 'GPL v3' __copyright__ = '2020, Pat Stapleton <pat.stapleton at gmail.com>' ''' Recipe for ABC News Australia (online) ''' from calibre.web.feeds.news import BasicNewsRecipe class ABCNews(BasicNewsRecipe): title = 'ABC News' language = 'en_AU' __author__ = 'Pat Stapleton' description = 'From the Australian Broadcasting Corporation. The ABC is owned and funded by the Australian Government, but is editorially independent.' masthead_url = 'https://www.abc.net.au/cm/lb/8212706/data/news-logo-2017---desktop-print-data.png' cover_url = 'https://www.abc.net.au/news/linkableblob/8413676/data/abc-news-og-data.jpg' cover_margins = (0,20,'#000000') scale_news_images_to_device = True oldest_article = 2 # days simultaneous_downloads = 10 max_articles_per_feed = 100 publication_type = 'newspaper' # auto_cleanup = True # enable this as a backup option if recipe stops working # use_embedded_content = False # if set to true will assume that all the article content is within the feed (i.e. won't try to fetch more data) no_stylesheets = True remove_javascript = True keep_only_tags = [dict(id='content')] # the article content is contained in <main id="content" /> tag # ************************************ # Regular expressions for remove_tags: # ************************************ # remove aside tag - used for overlapping boxes within article # aside_reg_exp = '^.*aside.*$' # ************************************ # Clear out all the unwanted html tags: # ************************************ remove_tags = [ # dict(name='aside', attrs={'name': re.compile(aside_reg_exp, re.IGNORECASE)}) { 'name': ['meta', 'link', 'noscript', 'aside'] }, { 'attrs': { 'data-component': ['Ticker', 'PublishedDate', 'Timestamp', 'Link', 'ShareLink', 'ShareUtility', 'RelatedStories', 'ArticleTopStories', 'ArticleTopStoriesCard', 'ArticleJustInStories', 'RelatedTopics', 'Player', 'ArticleSidebar', 'TopStoriesSidebar', 'UtilityBar'] } } ] # ************************************ # Tidy up the output to look neat for reading # ************************************ remove_attributes = ['width', 'height', 'style'] extra_css = '.byline{font-size:smaller;margin-bottom:10px;}.inline-caption{display:block;font-size:smaller;text-decoration: none;}' # ************************************ # Fix images (dynamically generated by ABC news) # ************************************ def preprocess_html(self, soup): for img in soup.findAll('img', attrs={'data-src': True}): for x in img['data-src'].split(): if '/' in x: img['src'] = x return soup compress_news_images = True feeds = [ ('Top Stories', 'https://www.abc.net.au/news/feed/45910/rss.xml'), ('Politics', 'https://www.abc.net.au/news/feed/1534/rss.xml'), ('World', 'https://www.abc.net.au/news/feed/4405318/rss.xml'), ('Business', 'https://www.abc.net.au/news/feed/51892/rss.xml'), ('Analysis', 'https://www.abc.net.au/news/feed/7571268/rss.xml'), ('Sport', 'https://www.abc.net.au/news/feed/2942460/rss.xml'), ('Science', 'https://www.abc.net.au/news/feed/8132426/rss.xml'), ('Health', 'https://www.abc.net.au/news/feed/9167762/rss.xml'), ('Arts and Entertainment', 'https://www.abc.net.au/news/feed/472/rss.xml'), ('Fact Check', 'https://www.abc.net.au/news/feed/5306468/rss.xml'), # ('Adelaide', 'https://www.abc.net.au/news/feed/8057540/rss.xml'), #enable by removing # at start of line # ('Brisbane', 'https://www.abc.net.au/news/feed/8053540/rss.xml'), #enable by removing # at start of line # ('Canberra', 'https://www.abc.net.au/news/feed/8057234/rss.xml'), #enable by removing # at start of line # ('Darwin', 'https://www.abc.net.au/news/feed/8057648/rss.xml'), #enable by removing # at start of line # ('Hobart', 'https://www.abc.net.au/news/feed/8054562/rss.xml'), #enable by removing # at start of line # ('Melbourne', 'https://www.abc.net.au/news/feed/8057136/rss.xml'), #enable by removing # at start of line # ('Perth', 'https://www.abc.net.au/news/feed/8057096/rss.xml'), #enable by removing # at start of line # ('Sydney', 'https://www.abc.net.au/news/feed/8055316/rss.xml'), #enable by removing # at start of line ] |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
[Enhancement] Add new news sources of ABC NEWS | donnie888 | Recipes | 0 | 12-23-2012 12:39 AM |
Nature news - updated recipe | Alexis | Recipes | 3 | 10-05-2012 02:36 PM |
Updated Hacker News Recipe | docgnome | Recipes | 3 | 12-16-2011 11:40 PM |
Recipe for ABC News (Australia) | RedDogInCan | Recipes | 5 | 11-20-2011 10:16 AM |
Updated Telepolis (News+Artikel) Recipe | syntaxis | Recipes | 8 | 05-15-2011 06:40 AM |