Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 02-28-2012, 09:12 AM   #1
cornfieldcraig
Member
cornfieldcraig began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Sep 2011
Location: Chicago, Illinois, USA
Device: Nook Simple Touch
More Chicago Tribune antics - recipe broken again



Well, it looks like the Chicago Tribune is trying harder to prevent folks from downloading their RSS content in batch mode. They've implemented a new countdown timer which pops up a black box with counts down from 18 seconds to zero, then displays the article. There's no text in the box, just the numbers, but if you close the box by clicking the pseudo-X in the upper right-hand corner, it goes to the article immediately. The result is no articles actually download -- just the article titles.

Here's a sample URL: http://feedproxy.google.com/~r/chica...Bg/story01.htm

Here's the bit of code that seems to be causing the problem:


<script type="text/javascript">

$(document).ready(function(){

doCountdown(18000/1000);



setTimeout( 'location.href = \'http://www.chicagotribune.com/sports/hockey/blackhawks/ct-spt-0228-blackhawks-trade-chicago--20120228,0,3141821.story?track=rss\'',18000);



});



function doCountdown(countdownTime) {

countdownRemaining = countdownTime - 1;

if(countdownRemaining > 0) {

$("#timeCountdown").text(countdownRemaining);

setTimeout("doCountdown(countdownRemaining);", 1000);

}

};
cornfieldcraig is offline   Reply With Quote
Old 02-28-2012, 09:31 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Easily fixed. Just change get_article_url to
Code:
    def get_article_url(self, article):
        ans = None
        try:
            s = article.summary
            ans = urllib.unquote(
                re.search(r'href=".+?bookmark.cfm.+?link=(.+?)"', s).group(1))
        except:
            pass
        if ans is not None:
            return ans.replace('?track=rss', '')
kovidgoyal is offline   Reply With Quote
Old 02-28-2012, 08:23 PM   #3
cornfieldcraig
Member
cornfieldcraig began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Sep 2011
Location: Chicago, Illinois, USA
Device: Nook Simple Touch
Thanks Kovid.

get_article_url already had been redefined in that recipe to:



def get_article_url(self, article):
print article.get('feedburner_origlink', article.get('guid', article.get('link')))
return article.get('feedburner_origlink', article.get('guid', article.get('link')))

Simply replacing it with the new code doesn't seem to work, unfortunately. I suspect that the two solutions need to be merged somehow.
cornfieldcraig is offline   Reply With Quote
Old 02-28-2012, 10:21 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Code:
    def get_article_url(self, article):
        ans = None
        try:
            s = article.summary
            ans = urllib.unquote(
                re.search(r'href=".+?bookmark.cfm.+?link=(.+?)"', s).group(1))
        except:
            pass
        if ans is None:
            ans = article.get('feedburner_origlink', article.get('guid', article.get('link')))
        if ans is not None:
            return ans.replace('?track=rss', '')
kovidgoyal is offline   Reply With Quote
Old 02-29-2012, 08:55 AM   #5
cornfieldcraig
Member
cornfieldcraig began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Sep 2011
Location: Chicago, Illinois, USA
Device: Nook Simple Touch
Thanks again Kovid, but still no go. With this last change, we're back to at least retrieving the titles, but still no actual content. It doesn't seem to be getting past the countdown timer.

Here's one of the links it attempted to retrieve today: http://chicagotribune.feedsportal.co...ss/story01.htm
cornfieldcraig is offline   Reply With Quote
Old 02-29-2012, 09:27 AM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I just tried it with that and got a full download.
kovidgoyal is offline   Reply With Quote
Old 03-01-2012, 12:42 AM   #7
cornfieldcraig
Member
cornfieldcraig began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Sep 2011
Location: Chicago, Illinois, USA
Device: Nook Simple Touch
That's really strange. Just to eliminate as many variables as possible, I took the following steps to implement your suggested change:
  • Deleted my old customized Chicago Tribune recipe, which included a fix you provided a few weeks ago.
  • Upgraded Calibre to 8.41
  • Created a new custom recipe from the built-in Chicago Tribune recipe
  • Replaced the exising get_article_url function with your new code
  • Eliminated most of the feeds, just to speed up testing
The result is that it downloads a list of articles, complete with thumbnail descriptions in the table of contents, but the articles themselves contain only the link to the article and no text. I've attached the resulting ePub file. Thanks again for your assistance.

Here's the recipe I am using:

Code:
from __future__ import with_statement
__license__ = 'GPL 3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'

from calibre.web.feeds.news import BasicNewsRecipe

class ChicagoTribune(BasicNewsRecipe):

    title       = 'Chicago Tribune'
    __author__  = 'Kovid Goyal and Sujata Raman, a.peter'
    description = 'Politics, local and business news from Chicago'
    language    = 'en'
    version     = 2

    use_embedded_content = False
    no_stylesheets       = True
    remove_javascript    = True
    recursions           = 1

    keep_only_tags = [dict(name='div', attrs={'class':["story","entry-asset asset hentry"]}),
                      dict(name='div', attrs={'id':["pagebody","story","maincontentcontainer"]}),
                           ]
    remove_tags_after = [{'class':['photo_article',]}]

    match_regexps = [r'page=[0-9]+']

    remove_tags = [{'id':["moduleArticleTools","content-bottom","rail","articleRelates module","toolSet","relatedrailcontent","div-wrapper","beta","atp-comments","footer",'gallery-subcontent','subFooter']},
                   {'class':["clearfix","relatedTitle","articleRelates module","asset-footer","tools","comments","featurePromo","featurePromo fp-topjobs brownBackground","clearfix fullSpan brownBackground","curvedContent",'nextgen-share-tools','outbrainTools', 'google-ad-story-bottom']},
                   dict(name='font',attrs={'id':["cr-other-headlines"]})]
    extra_css = '''
                    h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
                    h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
                    .byline {font-family:Arial,Helvetica,sans-serif; font-size:xx-small;}
                    .date {font-family:Arial,Helvetica,sans-serif; font-size:xx-small;}
                    p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .copyright {font-family:Arial,Helvetica,sans-serif;font-size:xx-small;text-align:center}
                    .story{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .entry-asset asset hentry{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .pagebody{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .maincontentcontainer{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .story-body{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
                '''
    feeds = [
             ('Latest news', 'http://feeds.chicagotribune.com/chicagotribune/news/'),
             ('Julie\'s Health Club', 'http://feeds.chicagotribune.com/chicagotribune_julieshealthclub/'),
             ]


#    def get_article_url(self, article):
#       url = article.get('feedburner_origlink', article.get('guid', article.get('link')))
#      if url.endswith('?track=rss'):
#         url = url.partition('?')[0]
#    return url

    def get_article_url(self, article):
        ans = None
        try:
            s = article.summary
            ans = urllib.unquote(
                re.search(r'href=".+?bookmark.cfm.+?link=(.+?)"', s).group(1))
        except:
            pass
        if ans is None:
            ans = article.get('feedburner_origlink', article.get('guid', article.get('link')))
        if ans is not None:
            return ans.replace('?track=rss', '')

    def skip_ad_pages(self, soup):
        text = soup.find(text='click here to continue to article')
        if text:
            a = text.parent
            url = a.get('href')
            if url:
                return self.index_to_soup(url, raw=True)

    def postprocess_html(self, soup, first_fetch):
        # Remove the navigation bar. It was kept until now to be able to follow
        # the links to further pages. But now we don't need them anymore.
        for nav in soup.findAll(attrs={'class':['toppaginate','article-nav clearfix']}):
            nav.extract()

        for t in soup.findAll(['table', 'tr', 'td']):
            t.name = 'div'

        for tag in soup.findAll('form', dict(attrs={'name':["comments_form"]})):
            tag.extract()
        for tag in soup.findAll('font', dict(attrs={'id':["cr-other-headlines"]})):
            tag.extract()

        return soup
Attached Files
File Type: epub Chicago Tribune [Wed, 29 Feb 2012] - calibre.epub (117.8 KB, 149 views)
cornfieldcraig is offline   Reply With Quote
Old 03-01-2012, 12:44 AM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Use the builtin recipe. I've already updated it.
kovidgoyal is offline   Reply With Quote
Old 03-01-2012, 09:38 PM   #9
cornfieldcraig
Member
cornfieldcraig began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Sep 2011
Location: Chicago, Illinois, USA
Device: Nook Simple Touch
I can't explain it, but the built-in recipe is acting the same. It downloads article titles and descriptions, but nothing else. Same thing on two different computers. One XP and the other Vista.
cornfieldcraig is offline   Reply With Quote
Old 03-04-2012, 01:28 AM   #10
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Hmm well I cannot replicate it, so I cannot fix the recipe for it.
kovidgoyal is offline   Reply With Quote
Old 03-05-2012, 08:50 PM   #11
cornfieldcraig
Member
cornfieldcraig began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Sep 2011
Location: Chicago, Illinois, USA
Device: Nook Simple Touch
Thanks anyway, Kovid. The only reason I can think of that you're not seeing the problem is that it has something to do with a cookie, that is, if calibre's browser supports them. Once the first countdown expires in the browser, subsequent pages do not have the countdown timer.

I think if there were a way to go to the URL associated with location.href (if present in the page source), it would work. There's current skip_ad_pages function, which you added a few months ago to skip a very similar countdown timer, looks like it would be close to fixing the problem. What has changed is that before, the javascript displayed the text, "click here to continue to article"; whereas now it displays just the numbers (18 to 0) in the countdown, so the text it displays is no longer unique enough to search on.

current skip_ad_pages function

Code:
    def skip_ad_pages(self, soup):
        text = soup.find(text='click here to continue to article')
        if text:
            a = text.parent
            url = a.get('href')
            if url:
                return self.index_to_soup(url, raw=True)
New Javascript Countdown Code

Code:
<script type="text/javascript">
           $(document).ready(function(){
               doCountdown(18000/1000);
			   
setTimeout( 'location.href = \'http://www.chicagotribune.com/news/chi-shootings-in-washington-park-englewood-leave-1-dead-1-wounded-20120302,0,4578641.story?track=rss\'',18000);

           });

           function doCountdown(countdownTime) {
               countdownRemaining = countdownTime - 1;
               if(countdownRemaining > 0) {
                   $("#timeCountdown").text(countdownRemaining);
                   setTimeout("doCountdown(countdownRemaining);", 1000);
               }
           };



</script>
cornfieldcraig is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Recipe for Chicago Tribune Printers Row? Sydney's Mom Recipes 0 02-26-2012 05:26 PM
Chicago Tribune Printer's Row moyogi2 General Discussions 1 02-09-2012 09:49 PM
Chicago Tribune Recipe appears broken cornfieldcraig Recipes 4 02-02-2012 10:43 PM
Chicago Tribune Recipe not selecting full article cornfieldcraig Recipes 3 09-29-2011 02:31 AM
Chicago Tribune now available on the Kindle! daffy4u Amazon Kindle 14 08-11-2008 01:10 PM


All times are GMT -4. The time now is 12:49 PM.


MobileRead.com is a privately owned, operated and funded community.