View Single Post
Old 04-03-2015, 12:47 PM   #2
mikebw
Member
mikebw began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Nov 2014
Device: none
FIXED: Providence Journal recipe compatible with new web site

This is a tested and so-far working recipe for the new Providence Journal web site since the sale of the newspaper.

It was my first exploration into recipe writing, and it turned out to be both much simpler than I expected because of the very rich and well-designed structure of Calibre, but also more complicated because of strangeness in the web site. Unless the print version of articles is explicitly requested, about 80% of the time the returned page ends up with a blank body. I suspect this is an issue with the heuristic parsing in Calibre, but I don't know enough to even begin diagnosing that. In any case, the quickest and easiest solution seems to be to override the "print_version()" method, although I did have to read through the source code for "news.py" in order to figure out how to do that.

Before switching to the print version of articles, I unsuccessfully tried setting "delay = 1" (which also implies "simultaneous_downloads = 1"), but that had no good effect. By requesting print versions, I've dropped that idea and this recipe runs at full speed.

I hereby release this recipe into the public domain and would be thrilled to see it included as the new built-in for this web site.


Code:
from calibre.web.feeds.news import BasicNewsRecipe

class ProvidenceJournal(BasicNewsRecipe):
    title          = u'Providence Journal'
    language       = 'en'
    __author__     = 'mikebw'
    oldest_article = 10  # days
    max_articles_per_feed = 100

    no_stylesheets = True
    auto_cleanup = True
    use_embedded_content = False
    ignore_duplicate_articles = {'url'}
    
    publication_type = 'newspaper'
    masthead_url = 'http://www.providencejournal.com/Global/images/head/nameplate/providence-journal_logo.png'

    # ProJo web site often returns blank articles unless print version is explicitly requested
    def print_version(self, url):
        return url + '&template=printart'
    
# RSS sources documented at http://www.providencejournal.com/section/feed?refresh=false

    feeds          = [
    
('News',
 'http://www.providencejournal.com/news?template=rss&mime=xml'),
('Politics',
 'http://www.providencejournal.com/politics?template=rss&mime=xml'),
('Sports',
 'http://www.providencejournal.com/sports?template=rss&mime=xml'),
('Business',
 'http://www.providencejournal.com/business?template=rss&mime=xml'),
('Opinion',
 'http://www.providencejournal.com/opinion?template=rss&mime=xml'),
('Entertainment',
 'http://www.providencejournal.com/entertainment?template=rss&mime=xml'),
('Lifestyle',
 'http://www.providencejournal.com/lifestyle?template=rss&mime=xml'),
('Food',
 'http://www.providencejournal.com/food?template=rss&mime=xml'),
('Cars',
 'http://www.providencejournal.com/cars?template=rss&mime=xml'),
('Weather',
 'http://www.providencejournal.com/weather?template=rss&mime=xml'),

]


Quote:
Originally Posted by mikebw View Post
The web site for The Providence Journal -- http://www.providencejournal.com -- has been changed extensively over the last few weeks but finally seems to have stabilized. The changes broke the recipe included with the distribution, and it is still broken as of Calibre 2.22.

Is anyone else using this recipe? I have no experience with recipe writing myself, but if no one else is motivated to fix it then I can take a shot at it.
mikebw is offline   Reply With Quote