Chicago Tribune Recipe not selecting full article

cornfieldcraig · 09-27-2011, 08:38 PM

I've been fiddling with the built-in Chicago Tribune recipe to add a few more RSS feeds. That's working fine, however, I've noticed that for longer articles, the recipe is sometimes missing substantial portions. The Chicago Tribune uses Feedburner to publish its RSS feeds. The recipe appears to download the article linked by Feedburner; however, the longer articles will have links to multiple pages and will also provide a Single Page link. Unfortunately, the Single Page link is not something that is consistently present, nor can be predicted. You must download the Feedburner page, analyze it for the Single Page link, then download that alternate page instead. This is beyond my meager understanding of the API to implement myself. Any help would be greatly appreciated.

Of course, I'd love it if the author, Kovid Goyal, can figure out a way to make this enhancement.

a.peter · 09-28-2011, 06:18 AM

Quote:

Originally Posted by cornfieldcraig

Of course, I'd love it if the author, Kovid Goyal, can figure out a way to make this enhancement.

No need to enhance, Calibre, it already does.

Each recipe provides the variable match_regexps. Eatch URL that matches these regular expression is follwed, when the variable recursions is set to a value of 1 or greater.

It is important, that the links to be followed aren't reomved by any of the remove_tags*

An updated version of the recipe that will follow links is here:

Spoiler:

Code:

from __future__ import with_statement
__license__ = 'GPL 3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'

from calibre.web.feeds.news import BasicNewsRecipe
import re

class ChicagoTribune(BasicNewsRecipe):

    title       = 'Chicago Tribune'
    __author__  = 'Kovid Goyal and Sujata Raman, a.peter'
    description = 'Politics, local and business news from Chicago'
    language    = 'en'
    version     = 2

    use_embedded_content = False
    no_stylesheets       = True
    remove_javascript    = True
    recursions           = 1

    keep_only_tags = [dict(name='div', attrs={'class':["story","entry-asset asset hentry"]}),
                      dict(name='div', attrs={'id':["pagebody","story","maincontentcontainer"]}),
                           ]
    remove_tags_after = [{'class':['photo_article',]}]

    match_regexps = [r'page=[0-9]+']
    
    remove_tags = [{'id':["moduleArticleTools","content-bottom","rail","articleRelates module","toolSet","relatedrailcontent","div-wrapper","beta","atp-comments","footer",'gallery-subcontent','subFooter']},
                   {'class':["clearfix","relatedTitle","articleRelates module","asset-footer","tools","comments","featurePromo","featurePromo fp-topjobs brownBackground","clearfix fullSpan brownBackground","curvedContent",'nextgen-share-tools','outbrainTools', 'google-ad-story-bottom']},
                   dict(name='font',attrs={'id':["cr-other-headlines"]})]
    extra_css = '''
                    h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
                    h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
                    .byline {font-family:Arial,Helvetica,sans-serif; font-size:xx-small;}
                    .date {font-family:Arial,Helvetica,sans-serif; font-size:xx-small;}
                    p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .copyright {font-family:Arial,Helvetica,sans-serif;font-size:xx-small;text-align:center}
                    .story{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .entry-asset asset hentry{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .pagebody{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .maincontentcontainer{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .story-body{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
                '''
    feeds = [
             ('Latest news', 'http://feeds.chicagotribune.com/chicagotribune/news/'),
             ('Local news', 'http://feeds.chicagotribune.com/chicagotribune/news/local/'),
             ('Nation/world', 'http://feeds.chicagotribune.com/chicagotribune/news/nationworld/'),
             ('Hot topics', 'http://feeds.chicagotribune.com/chicagotribune/hottopics/'),
             ('Most E-mailed stories', 'http://feeds.chicagotribune.com/chicagotribune/email/'),
             ('Opinion', 'http://feeds.chicagotribune.com/chicagotribune/opinion/'),
             ('Off Topic', 'http://feeds.chicagotribune.com/chicagotribune/offtopic/'),
             #('Politics', 'http://feeds.chicagotribune.com/chicagotribune/politics/'),
             #('Special Reports', 'http://feeds.chicagotribune.com/chicagotribune/special/'),
             #('Religion News', 'http://feeds.chicagotribune.com/chicagotribune/religion/'),
             ('Business news', 'http://feeds.chicagotribune.com/chicagotribune/business/'),
             ('Jobs and Careers', 'http://feeds.chicagotribune.com/chicagotribune/career/'),
             ('Local scene', 'http://feeds.chicagotribune.com/chicagohomes/localscene/'),
             ('Phil Rosenthal', 'http://feeds.chicagotribune.com/chicagotribune/rosenthal/'),
             #('Tech Buzz', 'http://feeds.chicagotribune.com/chicagotribune/techbuzz/'),
             ('Your Money', 'http://feeds.chicagotribune.com/chicagotribune/yourmoney/'),
             ('Jon Hilkevitch - Getting around', 'http://feeds.chicagotribune.com/chicagotribune/gettingaround/'),
             ('Jon Yates - What\'s your problem?', 'http://feeds.chicagotribune.com/chicagotribune/problem/'),
             ('Garisson Keillor', 'http://feeds.chicagotribune.com/chicagotribune/keillor/'),
             ('Marks Jarvis - On Money', 'http://feeds.chicagotribune.com/chicagotribune/marksjarvisonmoney/'),
             ('Sports', 'http://feeds.chicagotribune.com/chicagotribune/sports/'),
             ('Arts and Architecture', 'http://feeds.chicagotribune.com/chicagotribune/arts/'),
             ('Books', 'http://feeds.chicagotribune.com/chicagotribune/books/'),
             #('Magazine', 'http://feeds.chicagotribune.com/chicagotribune/magazine/'),
             ('Movies', 'http://feeds.chicagotribune.com/chicagotribune/movies/'),
             ('Music', 'http://feeds.chicagotribune.com/chicagotribune/music/'),
             ('TV', 'http://feeds.chicagotribune.com/chicagotribune/tv/'),
             ('Hypertext', 'http://feeds.chicagotribune.com/chicagotribune/hypertext/'),
             ('iPhone Blog', 'http://feeds.feedburner.com/redeye/iphoneblog'),
             ('Julie\'s Health Club', 'http://feeds.chicagotribune.com/chicagotribune_julieshealthclub/'),
             ]


    def get_article_url(self, article):
        print article.get('feedburner_origlink', article.get('guid', article.get('link')))
        return article.get('feedburner_origlink', article.get('guid', article.get('link')))
    
    def postprocess_html(self, soup, first_fetch):
        # Remove the navigation bar. It was kept until now to be able to follow
        # the links to further pages. But now we don't need them anymore.
        for nav in soup.findAll(attrs={'class':['toppaginate','article-nav clearfix']}):
            nav.extract()
       
        for t in soup.findAll(['table', 'tr', 'td']):
            t.name = 'div'

        for tag in soup.findAll('form', dict(attrs={'name':["comments_form"]})):
            tag.extract()
        for tag in soup.findAll('font', dict(attrs={'id':["cr-other-headlines"]})):
            tag.extract()

        return soup

cornfieldcraig · 09-28-2011, 11:44 PM

Thanks much for the quick response. Works like a charm. For kicks, I used this bit of code instead and it seemed to yield virtually identical results:

match_regexps = [r'full\.column']

a.peter · 09-29-2011, 02:31 AM

Quote:

Originally Posted by cornfieldcraig

Thanks much for the quick response. Works like a charm. For kicks, I used this bit of code instead and it seemed to yield virtually identical results

In fact the results are not really the same. Your version appends a full article version to the first page of the article, having the beginning twice in the ebook.

An example for the todays issue is the article here.

If you want to prevent an article to be broken into several chapters, you will have to implement the get_article_url method. You will have to read the page into a Soup, analyze if it has a "single page" link (e.g. with your regex) and return the link to the complete page.

09-27-2011, 08:38 PM	#1
cornfieldcraig Member Posts: 12 Karma: 10 Join Date: Sep 2011 Location: Chicago, Illinois, USA Device: Nook Simple Touch	Chicago Tribune Recipe not selecting full article I've been fiddling with the built-in Chicago Tribune recipe to add a few more RSS feeds. That's working fine, however, I've noticed that for longer articles, the recipe is sometimes missing substantial portions. The Chicago Tribune uses Feedburner to publish its RSS feeds. The recipe appears to download the article linked by Feedburner; however, the longer articles will have links to multiple pages and will also provide a Single Page link. Unfortunately, the Single Page link is not something that is consistently present, nor can be predicted. You must download the Feedburner page, analyze it for the Single Page link, then download that alternate page instead. This is beyond my meager understanding of the API to implement myself. Any help would be greatly appreciated. Of course, I'd love it if the author, Kovid Goyal, can figure out a way to make this enhancement.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Interesting, but flawed, article on eBooks in the International Herald Tribune	luqmaninbmore	News	14	08-17-2011 10:50 AM
Engadget recipe - full article text	UnWeave	Recipes	5	07-03-2011 11:01 PM
Chronicle Tribune recipe help	madman911	Recipes	0	01-29-2011 11:33 PM
Decorate article headings as hyperlinks to full article?	tomsem	Recipes	5	10-15-2010 08:30 PM
Chicago Tribune now available on the Kindle!	daffy4u	Amazon Kindle	14	08-11-2008 01:10 PM

09-28-2011, 11:44 PM	#3
cornfieldcraig Member Posts: 12 Karma: 10 Join Date: Sep 2011 Location: Chicago, Illinois, USA Device: Nook Simple Touch	Thanks much for the quick response. Works like a charm. For kicks, I used this bit of code instead and it seemed to yield virtually identical results: match_regexps = [r'full\.column']

Advert