Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 12-26-2009, 12:37 PM   #1
Being
Junior Member
Being began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Dec 2009
Device: Kindle
Fetch Hartford Courant based on Tribune recipe

I would like to fetch the Hartford Courant based on the recipe for the Chicago Tribune. (Both have similar Web sites, and the Courant is owned by the Tribune Co.) Is the recipe for the Chicago Tribune viewable? Has anyone tried to create a recipe for the Courant? Thanks for any help.
Being is offline   Reply With Quote
Old 12-26-2009, 02:27 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,322
Karma: 27111242
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
click the arrow next to the fetch news button and choose create custom recipe. Then click the modify builting recipe button and modify the chicago tribune
kovidgoyal is offline   Reply With Quote
Advert
Old 12-26-2009, 05:05 PM   #3
Being
Junior Member
Being began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Dec 2009
Device: Kindle
Thanks, kovidgoyal! Will do.
Being is offline   Reply With Quote
Old 12-26-2009, 05:19 PM   #4
Being
Junior Member
Being began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Dec 2009
Device: Kindle
It worked, kovidgoyal! Your program and site are terrific. Many, many thanks.
Being is offline   Reply With Quote
Old 12-26-2009, 07:40 PM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,322
Karma: 27111242
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Consider sharing your recipe so others can use it as well.
kovidgoyal is offline   Reply With Quote
Advert
Old 12-26-2009, 08:27 PM   #6
Being
Junior Member
Being began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Dec 2009
Device: Kindle
Yes, of course, kovidgoyal. It works, but I might have overlooked some needed changes to the original Chicago Tribune recipe. Anyway, here it is. It's to fetch The Hartford Courant:

Code:
from __future__ import with_statement
__license__ = 'GPL 3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'

from calibre.web.feeds.news import BasicNewsRecipe

class ChicagoTribune(BasicNewsRecipe):

    title       = 'The Hartford Courant'
    __author__  = 'Kovid Goyal and Sujata Raman'
    description = 'Politics, local and business news from Hartford'
    language = 'en'

    use_embedded_content    = False
    no_stylesheets        = True
    remove_javascript = True

    keep_only_tags = [dict(name='div', attrs={'class':["story","entry-asset asset hentry"]}),
                      dict(name='div', attrs={'id':["pagebody","story","maincontentcontainer"]}),
                           ]
    remove_tags_after = [    {'class':['photo_article',]} ]

    remove_tags = [{'id':["moduleArticleTools","content-bottom","rail","articleRelates module","toolSet","relatedrailcontent","div-wrapper","beta","atp-comments","footer"]},
                   {'class':["clearfix","relatedTitle","articleRelates module","asset-footer","tools","comments","featurePromo","featurePromo fp-topjobs brownBackground","clearfix fullSpan brownBackground","curvedContent"]},
                   dict(name='font',attrs={'id':["cr-other-headlines"]})]
    extra_css = '''
                    h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
                    h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
                    .byline {font-family:Arial,Helvetica,sans-serif; font-size:xx-small;}
                    .date {font-family:Arial,Helvetica,sans-serif; font-size:xx-small;}
                    p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .copyright {font-family:Arial,Helvetica,sans-serif;font-size:xx-small;text-align:center}
                    .story{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .entry-asset asset hentry{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .pagebody{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .maincontentcontainer{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .story-body{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
		'''
    feeds = [
             ('Breaking News', 'http://feeds.feedburner.com/courant-breaking-news/'),
             ('Connecticut News', 'http://feeds.feedburner.com/courant-connecticut-news/'),
             ('Hartford News', 'http://feeds.feedburner.com/courant-hartford/'),
             ('West Hartford News', 'http://feeds.feedburner.com/courant-west-hartford/'),
             ('Politics', 'http://feeds.feedburner.com/courant-politics/'),
             ('Opinion', 'http://feeds.feedburner.com/courant-opinion/'),
             ('Editorials', 'http://feeds.feedburner.com/courant-editorials/'),
             ('Letters', 'http://feeds.feedburner.com/courant-letters/'),
             ('Bob Englehart', 'http://feeds2.feedburner.com/BobEnglehartEnglehartsView'),
             ('Business', 'http://feeds.feedburner.com/courant-business/'),
             ('Consumer', 'http://feeds.feedburner.com/courant-consumer/'),
             ('Shopping', 'http://feeds.feedburner.com/courant-shopping/'),
             ('Arts & Theater', 'http://feeds.feedburner.com/courant-entertainment/'),
             ('Entertainment', 'http://feeds.feedburner.com/courant-stage/'),
             ('Music', 'http://feeds.feedburner.com/courant-music/'),
             ('TV', 'http://feeds.feedburner.com/courant-tv/'),
             ('Movies', 'http://feeds.feedburner.com/courant-movies/'),
             #('Metromix headlines', 'http://feeds.feedburner.com/metromix/topheadlines/'),
             #('Metromix events', 'http://feeds.feedburner.com/metromix/events/'),
             #('Metromix restaurants', 'http://feeds.feedburner.com/metromix/restaurants/'),
             ('Peter Marteka', 'http://feeds.feedburner.com/courant-marteka-column/'),
             ('Susan Campbell', 'http://feeds.feedburner.com/courant-campbell-column/'),
             ('Helen Ubinas', 'http://feeds.feedburner.com/courant-helen-ubinas-column/'),
             ('Jim Shea', 'http://feeds.feedburner.com/courant-jim-shea-column/'),
             ('Tom Condon', 'http://feeds.feedburner.com/courant-tom-condon-column/'),
             ('Colin McEnroe', 'http://feeds.feedburner.com/courant-colin-mcenroe-column/'),
             ]


    def get_article_url(self, article):
        print article.get('feedburner_origlink', article.get('guid', article.get('link')))
        return article.get('feedburner_origlink', article.get('guid', article.get('link')))


    def postprocess_html(self, soup, first_fetch):
        for t in soup.findAll(['table', 'tr', 'td']):
            t.name = 'div'

        for tag in soup.findAll('form', dict(attrs={'name':["comments_form"]})):
            tag.extract()
        for tag in soup.findAll('font', dict(attrs={'id':["cr-other-headlines"]})):
            tag.extract()

        return soup

Last edited by kovidgoyal; 12-26-2009 at 08:55 PM.
Being is offline   Reply With Quote
Old 12-27-2009, 09:54 AM   #7
Being
Junior Member
Being began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Dec 2009
Device: Kindle
This recipe for The Hartford Courant is a little more complete, with the addition of national news, sports, etc. The user can edit it to their liking, adding more columnists, etc., by going to www.courant.com, clicking on RSS at the bottom, and getting the correct URL's for the RSS feeds to add. For example, Politics is included with this line:

('Politics', 'http://feeds.feedburner.com/courant-politics/'),

Here's the complete recipe:

Code:
from __future__ import with_statement
__license__ = 'GPL 3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'

from calibre.web.feeds.news import BasicNewsRecipe

class ChicagoTribune(BasicNewsRecipe):

    title       = 'The Hartford Courant'
    __author__  = 'Kovid Goyal and Sujata Raman'
    description = 'Politics, local and business news from Hartford'
    language = 'en'

    use_embedded_content    = False
    no_stylesheets        = True
    remove_javascript = True

    keep_only_tags = [dict(name='div', attrs={'class':["story","entry-asset asset hentry"]}),
                      dict(name='div', attrs={'id':["pagebody","story","maincontentcontainer"]}),
                           ]
    remove_tags_after = [    {'class':['photo_article',]} ]

    remove_tags = [{'id':["moduleArticleTools","content-bottom","rail","articleRelates module","toolSet","relatedrailcontent","div-wrapper","beta","atp-comments","footer"]},
                   {'class':["clearfix","relatedTitle","articleRelates module","asset-footer","tools","comments","featurePromo","featurePromo fp-topjobs brownBackground","clearfix fullSpan brownBackground","curvedContent"]},
                   dict(name='font',attrs={'id':["cr-other-headlines"]})]
    extra_css = '''
                    h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
                    h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
                    .byline {font-family:Arial,Helvetica,sans-serif; font-size:xx-small;}
                    .date {font-family:Arial,Helvetica,sans-serif; font-size:xx-small;}
                    p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .copyright {font-family:Arial,Helvetica,sans-serif;font-size:xx-small;text-align:center}
                    .story{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .entry-asset asset hentry{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .pagebody{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .maincontentcontainer{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .story-body{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
		'''
    feeds = [
             ('Breaking News', 'http://feeds.feedburner.com/courant-breaking-news/'),
             ('Nation/World News', 'http://feeds.feedburner.com/courant-nation-world/'),
             ('Connecticut News', 'http://feeds.feedburner.com/courant-connecticut-news/'),
             ('Hartford News', 'http://feeds.feedburner.com/courant-hartford/'),
             ('West Hartford News', 'http://feeds.feedburner.com/courant-west-hartford/'),
             ('Bristol', 'http://feeds.feedburner.com/courant-bristol/'),
             ('Politics', 'http://feeds.feedburner.com/courant-politics/'),
             ('Opinion', 'http://feeds.feedburner.com/courant-opinion/'),
             ('Editorials', 'http://feeds.feedburner.com/courant-editorials/'),
             ('Letters', 'http://feeds.feedburner.com/courant-letters/'),
             ('Bob Englehart', 'http://feeds2.feedburner.com/BobEnglehartEnglehartsView'),
             ('Business', 'http://feeds.feedburner.com/courant-business/'),
             ('Sports', 'http://feeds.feedburner.com/courant-sports/'), 
             ('Features', 'http://feeds.feedburner.com/courant-features/'),
             ('Consumer', 'http://feeds.feedburner.com/courant-consumer/'),
             ('Shopping', 'http://feeds.feedburner.com/courant-shopping/'),
             ('Arts & Theater', 'http://feeds.feedburner.com/courant-entertainment/'),
             ('Entertainment', 'http://feeds.feedburner.com/courant-stage/'),
             ('Music', 'http://feeds.feedburner.com/courant-music/'),
             ('TV', 'http://feeds.feedburner.com/courant-tv/'),
             ('Movies', 'http://feeds.feedburner.com/courant-movies/'),
             #('Metromix headlines', 'http://feeds.feedburner.com/metromix/topheadlines/'),
             #('Metromix events', 'http://feeds.feedburner.com/metromix/events/'),
             #('Metromix restaurants', 'http://feeds.feedburner.com/metromix/restaurants/'),
             ('Outdoors', 'http://feeds.feedburner.com/courant-outdoors/'),
             ('Peter Marteka', 'http://feeds.feedburner.com/courant-marteka-column/'),
             ('Susan Campbell', 'http://feeds.feedburner.com/courant-campbell-column/'),
             ('Helen Ubinas', 'http://feeds.feedburner.com/courant-helen-ubinas-column/'),
             ('Jim Shea', 'http://feeds.feedburner.com/courant-jim-shea-column/'),
             ('Tom Condon', 'http://feeds.feedburner.com/courant-tom-condon-column/'),
             ('Colin McEnroe', 'http://feeds.feedburner.com/courant-colin-mcenroe-column/'),
             ]


    def get_article_url(self, article):
        print article.get('feedburner_origlink', article.get('guid', article.get('link')))
        return article.get('feedburner_origlink', article.get('guid', article.get('link')))


    def postprocess_html(self, soup, first_fetch):
        for t in soup.findAll(['table', 'tr', 'td']):
            t.name = 'div'

        for tag in soup.findAll('form', dict(attrs={'name':["comments_form"]})):
            tag.extract()
        for tag in soup.findAll('font', dict(attrs={'id':["cr-other-headlines"]})):
            tag.extract()

        return soup

Last edited by kovidgoyal; 12-27-2009 at 11:57 AM.
Being is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Fetch Recipe as PDF Jim77 Calibre 12 12-29-2010 09:07 AM
Updated New Yorker recipe doesn't fetch comics yekim54 Recipes 2 10-09-2010 10:47 PM
International Herald Tribune: European Edition Raoul O'Malley Calibre 1 05-02-2010 12:20 AM
Chicago Tribune now available on the Kindle! daffy4u Amazon Kindle 14 08-11-2008 01:10 PM
Herald Tribune on how e-books spur sales Alexander Turcic News 0 08-05-2005 05:09 PM


All times are GMT -4. The time now is 12:30 AM.


MobileRead.com is a privately owned, operated and funded community.