Fetch Hartford Courant based on Tribune recipe

Being · 12-26-2009, 12:37 PM

I would like to fetch the Hartford Courant based on the recipe for the Chicago Tribune. (Both have similar Web sites, and the Courant is owned by the Tribune Co.) Is the recipe for the Chicago Tribune viewable? Has anyone tried to create a recipe for the Courant? Thanks for any help.

kovidgoyal · 12-26-2009, 02:27 PM

click the arrow next to the fetch news button and choose create custom recipe. Then click the modify builting recipe button and modify the chicago tribune

Being · 12-26-2009, 05:05 PM

Thanks, kovidgoyal! Will do.

Being · 12-26-2009, 05:19 PM

It worked, kovidgoyal! Your program and site are terrific. Many, many thanks.

kovidgoyal · 12-26-2009, 07:40 PM

Consider sharing your recipe so others can use it as well.

Being · 12-26-2009, 08:27 PM

Yes, of course, kovidgoyal. It works, but I might have overlooked some needed changes to the original Chicago Tribune recipe. Anyway, here it is. It's to fetch The Hartford Courant:

Code:

from __future__ import with_statement
__license__ = 'GPL 3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'

from calibre.web.feeds.news import BasicNewsRecipe

class ChicagoTribune(BasicNewsRecipe):

    title       = 'The Hartford Courant'
    __author__  = 'Kovid Goyal and Sujata Raman'
    description = 'Politics, local and business news from Hartford'
    language = 'en'

    use_embedded_content    = False
    no_stylesheets        = True
    remove_javascript = True

    keep_only_tags = [dict(name='div', attrs={'class':["story","entry-asset asset hentry"]}),
                      dict(name='div', attrs={'id':["pagebody","story","maincontentcontainer"]}),
                           ]
    remove_tags_after = [    {'class':['photo_article',]} ]

    remove_tags = [{'id':["moduleArticleTools","content-bottom","rail","articleRelates module","toolSet","relatedrailcontent","div-wrapper","beta","atp-comments","footer"]},
                   {'class':["clearfix","relatedTitle","articleRelates module","asset-footer","tools","comments","featurePromo","featurePromo fp-topjobs brownBackground","clearfix fullSpan brownBackground","curvedContent"]},
                   dict(name='font',attrs={'id':["cr-other-headlines"]})]
    extra_css = '''
                    h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
                    h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
                    .byline {font-family:Arial,Helvetica,sans-serif; font-size:xx-small;}
                    .date {font-family:Arial,Helvetica,sans-serif; font-size:xx-small;}
                    p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .copyright {font-family:Arial,Helvetica,sans-serif;font-size:xx-small;text-align:center}
                    .story{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .entry-asset asset hentry{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .pagebody{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .maincontentcontainer{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .story-body{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
		'''
    feeds = [
             ('Breaking News', 'http://feeds.feedburner.com/courant-breaking-news/'),
             ('Connecticut News', 'http://feeds.feedburner.com/courant-connecticut-news/'),
             ('Hartford News', 'http://feeds.feedburner.com/courant-hartford/'),
             ('West Hartford News', 'http://feeds.feedburner.com/courant-west-hartford/'),
             ('Politics', 'http://feeds.feedburner.com/courant-politics/'),
             ('Opinion', 'http://feeds.feedburner.com/courant-opinion/'),
             ('Editorials', 'http://feeds.feedburner.com/courant-editorials/'),
             ('Letters', 'http://feeds.feedburner.com/courant-letters/'),
             ('Bob Englehart', 'http://feeds2.feedburner.com/BobEnglehartEnglehartsView'),
             ('Business', 'http://feeds.feedburner.com/courant-business/'),
             ('Consumer', 'http://feeds.feedburner.com/courant-consumer/'),
             ('Shopping', 'http://feeds.feedburner.com/courant-shopping/'),
             ('Arts & Theater', 'http://feeds.feedburner.com/courant-entertainment/'),
             ('Entertainment', 'http://feeds.feedburner.com/courant-stage/'),
             ('Music', 'http://feeds.feedburner.com/courant-music/'),
             ('TV', 'http://feeds.feedburner.com/courant-tv/'),
             ('Movies', 'http://feeds.feedburner.com/courant-movies/'),
             #('Metromix headlines', 'http://feeds.feedburner.com/metromix/topheadlines/'),
             #('Metromix events', 'http://feeds.feedburner.com/metromix/events/'),
             #('Metromix restaurants', 'http://feeds.feedburner.com/metromix/restaurants/'),
             ('Peter Marteka', 'http://feeds.feedburner.com/courant-marteka-column/'),
             ('Susan Campbell', 'http://feeds.feedburner.com/courant-campbell-column/'),
             ('Helen Ubinas', 'http://feeds.feedburner.com/courant-helen-ubinas-column/'),
             ('Jim Shea', 'http://feeds.feedburner.com/courant-jim-shea-column/'),
             ('Tom Condon', 'http://feeds.feedburner.com/courant-tom-condon-column/'),
             ('Colin McEnroe', 'http://feeds.feedburner.com/courant-colin-mcenroe-column/'),
             ]


    def get_article_url(self, article):
        print article.get('feedburner_origlink', article.get('guid', article.get('link')))
        return article.get('feedburner_origlink', article.get('guid', article.get('link')))


    def postprocess_html(self, soup, first_fetch):
        for t in soup.findAll(['table', 'tr', 'td']):
            t.name = 'div'

        for tag in soup.findAll('form', dict(attrs={'name':["comments_form"]})):
            tag.extract()
        for tag in soup.findAll('font', dict(attrs={'id':["cr-other-headlines"]})):
            tag.extract()

        return soup

Being · 12-27-2009, 09:54 AM

This recipe for The Hartford Courant is a little more complete, with the addition of national news, sports, etc. The user can edit it to their liking, adding more columnists, etc., by going to www.courant.com, clicking on RSS at the bottom, and getting the correct URL's for the RSS feeds to add. For example, Politics is included with this line:

('Politics', 'http://feeds.feedburner.com/courant-politics/'),

Here's the complete recipe:

Code:

from __future__ import with_statement
__license__ = 'GPL 3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'

from calibre.web.feeds.news import BasicNewsRecipe

class ChicagoTribune(BasicNewsRecipe):

    title       = 'The Hartford Courant'
    __author__  = 'Kovid Goyal and Sujata Raman'
    description = 'Politics, local and business news from Hartford'
    language = 'en'

    use_embedded_content    = False
    no_stylesheets        = True
    remove_javascript = True

    keep_only_tags = [dict(name='div', attrs={'class':["story","entry-asset asset hentry"]}),
                      dict(name='div', attrs={'id':["pagebody","story","maincontentcontainer"]}),
                           ]
    remove_tags_after = [    {'class':['photo_article',]} ]

    remove_tags = [{'id':["moduleArticleTools","content-bottom","rail","articleRelates module","toolSet","relatedrailcontent","div-wrapper","beta","atp-comments","footer"]},
                   {'class':["clearfix","relatedTitle","articleRelates module","asset-footer","tools","comments","featurePromo","featurePromo fp-topjobs brownBackground","clearfix fullSpan brownBackground","curvedContent"]},
                   dict(name='font',attrs={'id':["cr-other-headlines"]})]
    extra_css = '''
                    h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
                    h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
                    .byline {font-family:Arial,Helvetica,sans-serif; font-size:xx-small;}
                    .date {font-family:Arial,Helvetica,sans-serif; font-size:xx-small;}
                    p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .copyright {font-family:Arial,Helvetica,sans-serif;font-size:xx-small;text-align:center}
                    .story{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .entry-asset asset hentry{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .pagebody{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .maincontentcontainer{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .story-body{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
		'''
    feeds = [
             ('Breaking News', 'http://feeds.feedburner.com/courant-breaking-news/'),
             ('Nation/World News', 'http://feeds.feedburner.com/courant-nation-world/'),
             ('Connecticut News', 'http://feeds.feedburner.com/courant-connecticut-news/'),
             ('Hartford News', 'http://feeds.feedburner.com/courant-hartford/'),
             ('West Hartford News', 'http://feeds.feedburner.com/courant-west-hartford/'),
             ('Bristol', 'http://feeds.feedburner.com/courant-bristol/'),
             ('Politics', 'http://feeds.feedburner.com/courant-politics/'),
             ('Opinion', 'http://feeds.feedburner.com/courant-opinion/'),
             ('Editorials', 'http://feeds.feedburner.com/courant-editorials/'),
             ('Letters', 'http://feeds.feedburner.com/courant-letters/'),
             ('Bob Englehart', 'http://feeds2.feedburner.com/BobEnglehartEnglehartsView'),
             ('Business', 'http://feeds.feedburner.com/courant-business/'),
             ('Sports', 'http://feeds.feedburner.com/courant-sports/'), 
             ('Features', 'http://feeds.feedburner.com/courant-features/'),
             ('Consumer', 'http://feeds.feedburner.com/courant-consumer/'),
             ('Shopping', 'http://feeds.feedburner.com/courant-shopping/'),
             ('Arts & Theater', 'http://feeds.feedburner.com/courant-entertainment/'),
             ('Entertainment', 'http://feeds.feedburner.com/courant-stage/'),
             ('Music', 'http://feeds.feedburner.com/courant-music/'),
             ('TV', 'http://feeds.feedburner.com/courant-tv/'),
             ('Movies', 'http://feeds.feedburner.com/courant-movies/'),
             #('Metromix headlines', 'http://feeds.feedburner.com/metromix/topheadlines/'),
             #('Metromix events', 'http://feeds.feedburner.com/metromix/events/'),
             #('Metromix restaurants', 'http://feeds.feedburner.com/metromix/restaurants/'),
             ('Outdoors', 'http://feeds.feedburner.com/courant-outdoors/'),
             ('Peter Marteka', 'http://feeds.feedburner.com/courant-marteka-column/'),
             ('Susan Campbell', 'http://feeds.feedburner.com/courant-campbell-column/'),
             ('Helen Ubinas', 'http://feeds.feedburner.com/courant-helen-ubinas-column/'),
             ('Jim Shea', 'http://feeds.feedburner.com/courant-jim-shea-column/'),
             ('Tom Condon', 'http://feeds.feedburner.com/courant-tom-condon-column/'),
             ('Colin McEnroe', 'http://feeds.feedburner.com/courant-colin-mcenroe-column/'),
             ]


    def get_article_url(self, article):
        print article.get('feedburner_origlink', article.get('guid', article.get('link')))
        return article.get('feedburner_origlink', article.get('guid', article.get('link')))


    def postprocess_html(self, soup, first_fetch):
        for t in soup.findAll(['table', 'tr', 'td']):
            t.name = 'div'

        for tag in soup.findAll('form', dict(attrs={'name':["comments_form"]})):
            tag.extract()
        for tag in soup.findAll('font', dict(attrs={'id':["cr-other-headlines"]})):
            tag.extract()

        return soup

12-26-2009, 12:37 PM	#1
Being Junior Member Posts: 7 Karma: 10 Join Date: Dec 2009 Device: Kindle	Fetch Hartford Courant based on Tribune recipe I would like to fetch the Hartford Courant based on the recipe for the Chicago Tribune. (Both have similar Web sites, and the Courant is owned by the Tribune Co.) Is the recipe for the Chicago Tribune viewable? Has anyone tried to create a recipe for the Courant? Thanks for any help.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Fetch Recipe as PDF	Jim77	Calibre	12	12-29-2010 09:07 AM
Updated New Yorker recipe doesn't fetch comics	yekim54	Recipes	2	10-09-2010 10:47 PM
International Herald Tribune: European Edition	Raoul O'Malley	Calibre	1	05-02-2010 12:20 AM
Chicago Tribune now available on the Kindle!	daffy4u	Amazon Kindle	14	08-11-2008 01:10 PM
Herald Tribune on how e-books spur sales	Alexander Turcic	News	0	08-05-2005 05:09 PM

12-26-2009, 02:27 PM	#2
kovidgoyal creator of calibre Posts: 46,355 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	click the arrow next to the fetch news button and choose create custom recipe. Then click the modify builting recipe button and modify the chicago tribune

12-26-2009, 05:05 PM	#3
Being Junior Member Posts: 7 Karma: 10 Join Date: Dec 2009 Device: Kindle	Thanks, kovidgoyal! Will do.

12-26-2009, 05:19 PM	#4
Being Junior Member Posts: 7 Karma: 10 Join Date: Dec 2009 Device: Kindle	It worked, kovidgoyal! Your program and site are terrific. Many, many thanks.

12-26-2009, 07:40 PM	#5
kovidgoyal creator of calibre Posts: 46,355 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Consider sharing your recipe so others can use it as well.

Advert

Advert