03-26-2015, 03:21 AM | #1 |
Member
Posts: 22
Karma: 10
Join Date: Nov 2014
Device: none
|
Providence Journal recipe broken by web site changes
The web site for The Providence Journal -- http://www.providencejournal.com -- has been changed extensively over the last few weeks but finally seems to have stabilized. The changes broke the recipe included with the distribution, and it is still broken as of Calibre 2.22.
Is anyone else using this recipe? I have no experience with recipe writing myself, but if no one else is motivated to fix it then I can take a shot at it. |
04-03-2015, 12:47 PM | #2 | |
Member
Posts: 22
Karma: 10
Join Date: Nov 2014
Device: none
|
FIXED: Providence Journal recipe compatible with new web site
This is a tested and so-far working recipe for the new Providence Journal web site since the sale of the newspaper.
It was my first exploration into recipe writing, and it turned out to be both much simpler than I expected because of the very rich and well-designed structure of Calibre, but also more complicated because of strangeness in the web site. Unless the print version of articles is explicitly requested, about 80% of the time the returned page ends up with a blank body. I suspect this is an issue with the heuristic parsing in Calibre, but I don't know enough to even begin diagnosing that. In any case, the quickest and easiest solution seems to be to override the "print_version()" method, although I did have to read through the source code for "news.py" in order to figure out how to do that. Before switching to the print version of articles, I unsuccessfully tried setting "delay = 1" (which also implies "simultaneous_downloads = 1"), but that had no good effect. By requesting print versions, I've dropped that idea and this recipe runs at full speed. I hereby release this recipe into the public domain and would be thrilled to see it included as the new built-in for this web site. Code:
from calibre.web.feeds.news import BasicNewsRecipe class ProvidenceJournal(BasicNewsRecipe): title = u'Providence Journal' language = 'en' __author__ = 'mikebw' oldest_article = 10 # days max_articles_per_feed = 100 no_stylesheets = True auto_cleanup = True use_embedded_content = False ignore_duplicate_articles = {'url'} publication_type = 'newspaper' masthead_url = 'http://www.providencejournal.com/Global/images/head/nameplate/providence-journal_logo.png' # ProJo web site often returns blank articles unless print version is explicitly requested def print_version(self, url): return url + '&template=printart' # RSS sources documented at http://www.providencejournal.com/section/feed?refresh=false feeds = [ ('News', 'http://www.providencejournal.com/news?template=rss&mime=xml'), ('Politics', 'http://www.providencejournal.com/politics?template=rss&mime=xml'), ('Sports', 'http://www.providencejournal.com/sports?template=rss&mime=xml'), ('Business', 'http://www.providencejournal.com/business?template=rss&mime=xml'), ('Opinion', 'http://www.providencejournal.com/opinion?template=rss&mime=xml'), ('Entertainment', 'http://www.providencejournal.com/entertainment?template=rss&mime=xml'), ('Lifestyle', 'http://www.providencejournal.com/lifestyle?template=rss&mime=xml'), ('Food', 'http://www.providencejournal.com/food?template=rss&mime=xml'), ('Cars', 'http://www.providencejournal.com/cars?template=rss&mime=xml'), ('Weather', 'http://www.providencejournal.com/weather?template=rss&mime=xml'), ] Quote:
|
|
Advert | |
|
04-03-2015, 10:33 PM | #3 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Done. Just FYI, if you suspect parsing problems you cam implement preprocess_raw() in your recipe which will give you the unparsed html.
|
04-05-2015, 11:30 PM | #4 |
Member
Posts: 22
Karma: 10
Join Date: Nov 2014
Device: none
|
That's very useful advice, thank you. I'm very new at this, but the source code is very clean and the documentation is thorough.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Wall Street Journal (free) is broken | NSILMike | Recipes | 2 | 11-29-2014 03:47 PM |
Digital Journal Recipe Broken? | daletsteele | Recipes | 1 | 11-27-2014 01:13 AM |
Instapaper recipe - broken by site redesign? | adfadfsasdfafafd | Recipes | 11 | 06-02-2014 08:31 AM |
Providence Journal - news | Joe A | Recipes | 0 | 08-16-2013 08:03 AM |
Wall Street Journal recipe broken? | nisew | Recipes | 2 | 09-28-2011 05:08 PM |