View Single Post
Old 03-01-2012, 12:42 AM   #7
cornfieldcraig
Member
cornfieldcraig began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Sep 2011
Location: Chicago, Illinois, USA
Device: Nook Simple Touch
That's really strange. Just to eliminate as many variables as possible, I took the following steps to implement your suggested change:
  • Deleted my old customized Chicago Tribune recipe, which included a fix you provided a few weeks ago.
  • Upgraded Calibre to 8.41
  • Created a new custom recipe from the built-in Chicago Tribune recipe
  • Replaced the exising get_article_url function with your new code
  • Eliminated most of the feeds, just to speed up testing
The result is that it downloads a list of articles, complete with thumbnail descriptions in the table of contents, but the articles themselves contain only the link to the article and no text. I've attached the resulting ePub file. Thanks again for your assistance.

Here's the recipe I am using:

Code:
from __future__ import with_statement
__license__ = 'GPL 3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'

from calibre.web.feeds.news import BasicNewsRecipe

class ChicagoTribune(BasicNewsRecipe):

    title       = 'Chicago Tribune'
    __author__  = 'Kovid Goyal and Sujata Raman, a.peter'
    description = 'Politics, local and business news from Chicago'
    language    = 'en'
    version     = 2

    use_embedded_content = False
    no_stylesheets       = True
    remove_javascript    = True
    recursions           = 1

    keep_only_tags = [dict(name='div', attrs={'class':["story","entry-asset asset hentry"]}),
                      dict(name='div', attrs={'id':["pagebody","story","maincontentcontainer"]}),
                           ]
    remove_tags_after = [{'class':['photo_article',]}]

    match_regexps = [r'page=[0-9]+']

    remove_tags = [{'id':["moduleArticleTools","content-bottom","rail","articleRelates module","toolSet","relatedrailcontent","div-wrapper","beta","atp-comments","footer",'gallery-subcontent','subFooter']},
                   {'class':["clearfix","relatedTitle","articleRelates module","asset-footer","tools","comments","featurePromo","featurePromo fp-topjobs brownBackground","clearfix fullSpan brownBackground","curvedContent",'nextgen-share-tools','outbrainTools', 'google-ad-story-bottom']},
                   dict(name='font',attrs={'id':["cr-other-headlines"]})]
    extra_css = '''
                    h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
                    h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
                    .byline {font-family:Arial,Helvetica,sans-serif; font-size:xx-small;}
                    .date {font-family:Arial,Helvetica,sans-serif; font-size:xx-small;}
                    p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .copyright {font-family:Arial,Helvetica,sans-serif;font-size:xx-small;text-align:center}
                    .story{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .entry-asset asset hentry{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .pagebody{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .maincontentcontainer{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .story-body{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
                '''
    feeds = [
             ('Latest news', 'http://feeds.chicagotribune.com/chicagotribune/news/'),
             ('Julie\'s Health Club', 'http://feeds.chicagotribune.com/chicagotribune_julieshealthclub/'),
             ]


#    def get_article_url(self, article):
#       url = article.get('feedburner_origlink', article.get('guid', article.get('link')))
#      if url.endswith('?track=rss'):
#         url = url.partition('?')[0]
#    return url

    def get_article_url(self, article):
        ans = None
        try:
            s = article.summary
            ans = urllib.unquote(
                re.search(r'href=".+?bookmark.cfm.+?link=(.+?)"', s).group(1))
        except:
            pass
        if ans is None:
            ans = article.get('feedburner_origlink', article.get('guid', article.get('link')))
        if ans is not None:
            return ans.replace('?track=rss', '')

    def skip_ad_pages(self, soup):
        text = soup.find(text='click here to continue to article')
        if text:
            a = text.parent
            url = a.get('href')
            if url:
                return self.index_to_soup(url, raw=True)

    def postprocess_html(self, soup, first_fetch):
        # Remove the navigation bar. It was kept until now to be able to follow
        # the links to further pages. But now we don't need them anymore.
        for nav in soup.findAll(attrs={'class':['toppaginate','article-nav clearfix']}):
            nav.extract()

        for t in soup.findAll(['table', 'tr', 'td']):
            t.name = 'div'

        for tag in soup.findAll('form', dict(attrs={'name':["comments_form"]})):
            tag.extract()
        for tag in soup.findAll('font', dict(attrs={'id':["cr-other-headlines"]})):
            tag.extract()

        return soup
Attached Files
File Type: epub Chicago Tribune [Wed, 29 Feb 2012] - calibre.epub (117.8 KB, 149 views)
cornfieldcraig is offline   Reply With Quote