This is a tested and so-far working recipe for the new
Providence Journal web site since the sale of the newspaper.
It was my first exploration into recipe writing, and it turned out to be both much simpler than I expected because of the very rich and well-designed structure of Calibre, but also more complicated because of strangeness in the web site. Unless the print version of articles is explicitly requested, about 80% of the time the returned page ends up with a blank body. I suspect this is an issue with the heuristic parsing in Calibre, but I don't know enough to even begin diagnosing that. In any case, the quickest and easiest solution seems to be to override the "print_version()" method, although I did have to read through the source code for "news.py" in order to figure out how to do that.
Before switching to the print version of articles, I unsuccessfully tried setting "delay = 1" (which also implies "simultaneous_downloads = 1"), but that had no good effect. By requesting print versions, I've dropped that idea and this recipe runs at full speed.
I hereby release this recipe into the public domain and would be thrilled to see it included as the new built-in for this web site.
Code:
from calibre.web.feeds.news import BasicNewsRecipe
class ProvidenceJournal(BasicNewsRecipe):
title = u'Providence Journal'
language = 'en'
__author__ = 'mikebw'
oldest_article = 10 # days
max_articles_per_feed = 100
no_stylesheets = True
auto_cleanup = True
use_embedded_content = False
ignore_duplicate_articles = {'url'}
publication_type = 'newspaper'
masthead_url = 'http://www.providencejournal.com/Global/images/head/nameplate/providence-journal_logo.png'
# ProJo web site often returns blank articles unless print version is explicitly requested
def print_version(self, url):
return url + '&template=printart'
# RSS sources documented at http://www.providencejournal.com/section/feed?refresh=false
feeds = [
('News',
'http://www.providencejournal.com/news?template=rss&mime=xml'),
('Politics',
'http://www.providencejournal.com/politics?template=rss&mime=xml'),
('Sports',
'http://www.providencejournal.com/sports?template=rss&mime=xml'),
('Business',
'http://www.providencejournal.com/business?template=rss&mime=xml'),
('Opinion',
'http://www.providencejournal.com/opinion?template=rss&mime=xml'),
('Entertainment',
'http://www.providencejournal.com/entertainment?template=rss&mime=xml'),
('Lifestyle',
'http://www.providencejournal.com/lifestyle?template=rss&mime=xml'),
('Food',
'http://www.providencejournal.com/food?template=rss&mime=xml'),
('Cars',
'http://www.providencejournal.com/cars?template=rss&mime=xml'),
('Weather',
'http://www.providencejournal.com/weather?template=rss&mime=xml'),
]
Quote:
Originally Posted by mikebw
The web site for The Providence Journal -- http://www.providencejournal.com -- has been changed extensively over the last few weeks but finally seems to have stabilized. The changes broke the recipe included with the distribution, and it is still broken as of Calibre 2.22.
Is anyone else using this recipe? I have no experience with recipe writing myself, but if no one else is motivated to fix it then I can take a shot at it.
|