Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 03-26-2015, 03:21 AM   #1
mikebw
Member
mikebw began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Nov 2014
Device: none
Providence Journal recipe broken by web site changes

The web site for The Providence Journal -- http://www.providencejournal.com -- has been changed extensively over the last few weeks but finally seems to have stabilized. The changes broke the recipe included with the distribution, and it is still broken as of Calibre 2.22.

Is anyone else using this recipe? I have no experience with recipe writing myself, but if no one else is motivated to fix it then I can take a shot at it.
mikebw is offline   Reply With Quote
Old 04-03-2015, 12:47 PM   #2
mikebw
Member
mikebw began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Nov 2014
Device: none
FIXED: Providence Journal recipe compatible with new web site

This is a tested and so-far working recipe for the new Providence Journal web site since the sale of the newspaper.

It was my first exploration into recipe writing, and it turned out to be both much simpler than I expected because of the very rich and well-designed structure of Calibre, but also more complicated because of strangeness in the web site. Unless the print version of articles is explicitly requested, about 80% of the time the returned page ends up with a blank body. I suspect this is an issue with the heuristic parsing in Calibre, but I don't know enough to even begin diagnosing that. In any case, the quickest and easiest solution seems to be to override the "print_version()" method, although I did have to read through the source code for "news.py" in order to figure out how to do that.

Before switching to the print version of articles, I unsuccessfully tried setting "delay = 1" (which also implies "simultaneous_downloads = 1"), but that had no good effect. By requesting print versions, I've dropped that idea and this recipe runs at full speed.

I hereby release this recipe into the public domain and would be thrilled to see it included as the new built-in for this web site.


Code:
from calibre.web.feeds.news import BasicNewsRecipe

class ProvidenceJournal(BasicNewsRecipe):
    title          = u'Providence Journal'
    language       = 'en'
    __author__     = 'mikebw'
    oldest_article = 10  # days
    max_articles_per_feed = 100

    no_stylesheets = True
    auto_cleanup = True
    use_embedded_content = False
    ignore_duplicate_articles = {'url'}
    
    publication_type = 'newspaper'
    masthead_url = 'http://www.providencejournal.com/Global/images/head/nameplate/providence-journal_logo.png'

    # ProJo web site often returns blank articles unless print version is explicitly requested
    def print_version(self, url):
        return url + '&template=printart'
    
# RSS sources documented at http://www.providencejournal.com/section/feed?refresh=false

    feeds          = [
    
('News',
 'http://www.providencejournal.com/news?template=rss&mime=xml'),
('Politics',
 'http://www.providencejournal.com/politics?template=rss&mime=xml'),
('Sports',
 'http://www.providencejournal.com/sports?template=rss&mime=xml'),
('Business',
 'http://www.providencejournal.com/business?template=rss&mime=xml'),
('Opinion',
 'http://www.providencejournal.com/opinion?template=rss&mime=xml'),
('Entertainment',
 'http://www.providencejournal.com/entertainment?template=rss&mime=xml'),
('Lifestyle',
 'http://www.providencejournal.com/lifestyle?template=rss&mime=xml'),
('Food',
 'http://www.providencejournal.com/food?template=rss&mime=xml'),
('Cars',
 'http://www.providencejournal.com/cars?template=rss&mime=xml'),
('Weather',
 'http://www.providencejournal.com/weather?template=rss&mime=xml'),

]


Quote:
Originally Posted by mikebw View Post
The web site for The Providence Journal -- http://www.providencejournal.com -- has been changed extensively over the last few weeks but finally seems to have stabilized. The changes broke the recipe included with the distribution, and it is still broken as of Calibre 2.22.

Is anyone else using this recipe? I have no experience with recipe writing myself, but if no one else is motivated to fix it then I can take a shot at it.
mikebw is offline   Reply With Quote
Advert
Old 04-03-2015, 10:33 PM   #3
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Done. Just FYI, if you suspect parsing problems you cam implement preprocess_raw() in your recipe which will give you the unparsed html.
kovidgoyal is offline   Reply With Quote
Old 04-05-2015, 11:30 PM   #4
mikebw
Member
mikebw began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Nov 2014
Device: none
That's very useful advice, thank you. I'm very new at this, but the source code is very clean and the documentation is thorough.

Quote:
Originally Posted by kovidgoyal View Post
Done. Just FYI, if you suspect parsing problems you cam implement preprocess_raw() in your recipe which will give you the unparsed html.
mikebw is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Wall Street Journal (free) is broken NSILMike Recipes 2 11-29-2014 03:47 PM
Digital Journal Recipe Broken? daletsteele Recipes 1 11-27-2014 01:13 AM
Instapaper recipe - broken by site redesign? adfadfsasdfafafd Recipes 11 06-02-2014 08:31 AM
Providence Journal - news Joe A Recipes 0 08-16-2013 08:03 AM
Wall Street Journal recipe broken? nisew Recipes 2 09-28-2011 05:08 PM


All times are GMT -4. The time now is 02:48 PM.


MobileRead.com is a privately owned, operated and funded community.