Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 09-27-2011, 08:38 PM   #1
cornfieldcraig
Member
cornfieldcraig began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Sep 2011
Location: Chicago, Illinois, USA
Device: Nook Simple Touch
Chicago Tribune Recipe not selecting full article

I've been fiddling with the built-in Chicago Tribune recipe to add a few more RSS feeds. That's working fine, however, I've noticed that for longer articles, the recipe is sometimes missing substantial portions. The Chicago Tribune uses Feedburner to publish its RSS feeds. The recipe appears to download the article linked by Feedburner; however, the longer articles will have links to multiple pages and will also provide a Single Page link. Unfortunately, the Single Page link is not something that is consistently present, nor can be predicted. You must download the Feedburner page, analyze it for the Single Page link, then download that alternate page instead. This is beyond my meager understanding of the API to implement myself. Any help would be greatly appreciated.

Of course, I'd love it if the author, Kovid Goyal, can figure out a way to make this enhancement.
cornfieldcraig is offline   Reply With Quote
Old 09-28-2011, 06:18 AM   #2
a.peter
Enthusiast
a.peter began at the beginning.
 
Posts: 28
Karma: 10
Join Date: Sep 2011
Device: Sony PRS-350, Kindle Touch
Quote:
Originally Posted by cornfieldcraig View Post
Of course, I'd love it if the author, Kovid Goyal, can figure out a way to make this enhancement.
No need to enhance, Calibre, it already does.

Each recipe provides the variable match_regexps. Eatch URL that matches these regular expression is follwed, when the variable recursions is set to a value of 1 or greater.

It is important, that the links to be followed aren't reomved by any of the remove_tags*

An updated version of the recipe that will follow links is here:

Spoiler:
Code:
from __future__ import with_statement
__license__ = 'GPL 3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'

from calibre.web.feeds.news import BasicNewsRecipe
import re

class ChicagoTribune(BasicNewsRecipe):

    title       = 'Chicago Tribune'
    __author__  = 'Kovid Goyal and Sujata Raman, a.peter'
    description = 'Politics, local and business news from Chicago'
    language    = 'en'
    version     = 2

    use_embedded_content = False
    no_stylesheets       = True
    remove_javascript    = True
    recursions           = 1

    keep_only_tags = [dict(name='div', attrs={'class':["story","entry-asset asset hentry"]}),
                      dict(name='div', attrs={'id':["pagebody","story","maincontentcontainer"]}),
                           ]
    remove_tags_after = [{'class':['photo_article',]}]

    match_regexps = [r'page=[0-9]+']
    
    remove_tags = [{'id':["moduleArticleTools","content-bottom","rail","articleRelates module","toolSet","relatedrailcontent","div-wrapper","beta","atp-comments","footer",'gallery-subcontent','subFooter']},
                   {'class':["clearfix","relatedTitle","articleRelates module","asset-footer","tools","comments","featurePromo","featurePromo fp-topjobs brownBackground","clearfix fullSpan brownBackground","curvedContent",'nextgen-share-tools','outbrainTools', 'google-ad-story-bottom']},
                   dict(name='font',attrs={'id':["cr-other-headlines"]})]
    extra_css = '''
                    h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
                    h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
                    .byline {font-family:Arial,Helvetica,sans-serif; font-size:xx-small;}
                    .date {font-family:Arial,Helvetica,sans-serif; font-size:xx-small;}
                    p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .copyright {font-family:Arial,Helvetica,sans-serif;font-size:xx-small;text-align:center}
                    .story{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .entry-asset asset hentry{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .pagebody{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .maincontentcontainer{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .story-body{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
                '''
    feeds = [
             ('Latest news', 'http://feeds.chicagotribune.com/chicagotribune/news/'),
             ('Local news', 'http://feeds.chicagotribune.com/chicagotribune/news/local/'),
             ('Nation/world', 'http://feeds.chicagotribune.com/chicagotribune/news/nationworld/'),
             ('Hot topics', 'http://feeds.chicagotribune.com/chicagotribune/hottopics/'),
             ('Most E-mailed stories', 'http://feeds.chicagotribune.com/chicagotribune/email/'),
             ('Opinion', 'http://feeds.chicagotribune.com/chicagotribune/opinion/'),
             ('Off Topic', 'http://feeds.chicagotribune.com/chicagotribune/offtopic/'),
             #('Politics', 'http://feeds.chicagotribune.com/chicagotribune/politics/'),
             #('Special Reports', 'http://feeds.chicagotribune.com/chicagotribune/special/'),
             #('Religion News', 'http://feeds.chicagotribune.com/chicagotribune/religion/'),
             ('Business news', 'http://feeds.chicagotribune.com/chicagotribune/business/'),
             ('Jobs and Careers', 'http://feeds.chicagotribune.com/chicagotribune/career/'),
             ('Local scene', 'http://feeds.chicagotribune.com/chicagohomes/localscene/'),
             ('Phil Rosenthal', 'http://feeds.chicagotribune.com/chicagotribune/rosenthal/'),
             #('Tech Buzz', 'http://feeds.chicagotribune.com/chicagotribune/techbuzz/'),
             ('Your Money', 'http://feeds.chicagotribune.com/chicagotribune/yourmoney/'),
             ('Jon Hilkevitch - Getting around', 'http://feeds.chicagotribune.com/chicagotribune/gettingaround/'),
             ('Jon Yates - What\'s your problem?', 'http://feeds.chicagotribune.com/chicagotribune/problem/'),
             ('Garisson Keillor', 'http://feeds.chicagotribune.com/chicagotribune/keillor/'),
             ('Marks Jarvis - On Money', 'http://feeds.chicagotribune.com/chicagotribune/marksjarvisonmoney/'),
             ('Sports', 'http://feeds.chicagotribune.com/chicagotribune/sports/'),
             ('Arts and Architecture', 'http://feeds.chicagotribune.com/chicagotribune/arts/'),
             ('Books', 'http://feeds.chicagotribune.com/chicagotribune/books/'),
             #('Magazine', 'http://feeds.chicagotribune.com/chicagotribune/magazine/'),
             ('Movies', 'http://feeds.chicagotribune.com/chicagotribune/movies/'),
             ('Music', 'http://feeds.chicagotribune.com/chicagotribune/music/'),
             ('TV', 'http://feeds.chicagotribune.com/chicagotribune/tv/'),
             ('Hypertext', 'http://feeds.chicagotribune.com/chicagotribune/hypertext/'),
             ('iPhone Blog', 'http://feeds.feedburner.com/redeye/iphoneblog'),
             ('Julie\'s Health Club', 'http://feeds.chicagotribune.com/chicagotribune_julieshealthclub/'),
             ]


    def get_article_url(self, article):
        print article.get('feedburner_origlink', article.get('guid', article.get('link')))
        return article.get('feedburner_origlink', article.get('guid', article.get('link')))
    
    def postprocess_html(self, soup, first_fetch):
        # Remove the navigation bar. It was kept until now to be able to follow
        # the links to further pages. But now we don't need them anymore.
        for nav in soup.findAll(attrs={'class':['toppaginate','article-nav clearfix']}):
            nav.extract()
       
        for t in soup.findAll(['table', 'tr', 'td']):
            t.name = 'div'

        for tag in soup.findAll('form', dict(attrs={'name':["comments_form"]})):
            tag.extract()
        for tag in soup.findAll('font', dict(attrs={'id':["cr-other-headlines"]})):
            tag.extract()

        return soup
a.peter is offline   Reply With Quote
Advert
Old 09-28-2011, 11:44 PM   #3
cornfieldcraig
Member
cornfieldcraig began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Sep 2011
Location: Chicago, Illinois, USA
Device: Nook Simple Touch
Thanks much for the quick response. Works like a charm. For kicks, I used this bit of code instead and it seemed to yield virtually identical results:

match_regexps = [r'full\.column']
cornfieldcraig is offline   Reply With Quote
Old 09-29-2011, 02:31 AM   #4
a.peter
Enthusiast
a.peter began at the beginning.
 
Posts: 28
Karma: 10
Join Date: Sep 2011
Device: Sony PRS-350, Kindle Touch
Quote:
Originally Posted by cornfieldcraig View Post
Thanks much for the quick response. Works like a charm. For kicks, I used this bit of code instead and it seemed to yield virtually identical results
In fact the results are not really the same. Your version appends a full article version to the first page of the article, having the beginning twice in the ebook.

An example for the todays issue is the article here.

If you want to prevent an article to be broken into several chapters, you will have to implement the get_article_url method. You will have to read the page into a Soup, analyze if it has a "single page" link (e.g. with your regex) and return the link to the complete page.
a.peter is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Interesting, but flawed, article on eBooks in the International Herald Tribune luqmaninbmore News 14 08-17-2011 10:50 AM
Engadget recipe - full article text UnWeave Recipes 5 07-03-2011 11:01 PM
Chronicle Tribune recipe help madman911 Recipes 0 01-29-2011 11:33 PM
Decorate article headings as hyperlinks to full article? tomsem Recipes 5 10-15-2010 08:30 PM
Chicago Tribune now available on the Kindle! daffy4u Amazon Kindle 14 08-11-2008 01:10 PM


All times are GMT -4. The time now is 12:01 AM.


MobileRead.com is a privately owned, operated and funded community.