More Chicago Tribune antics - recipe broken again

cornfieldcraig · 02-28-2012, 09:12 AM

Well, it looks like the Chicago Tribune is trying harder to prevent folks from downloading their RSS content in batch mode. They've implemented a new countdown timer which pops up a black box with counts down from 18 seconds to zero, then displays the article. There's no text in the box, just the numbers, but if you close the box by clicking the pseudo-X in the upper right-hand corner, it goes to the article immediately. The result is no articles actually download -- just the article titles.

Here's a sample URL: http://feedproxy.google.com/~r/chica...Bg/story01.htm

Here's the bit of code that seems to be causing the problem:

<script type="text/javascript">

$(document).ready(function(){

doCountdown(18000/1000);

setTimeout( 'location.href = \'http://www.chicagotribune.com/sports/hockey/blackhawks/ct-spt-0228-blackhawks-trade-chicago--20120228,0,3141821.story?track=rss\'',18000);

});

function doCountdown(countdownTime) {

countdownRemaining = countdownTime - 1;

if(countdownRemaining > 0) {

$("#timeCountdown").text(countdownRemaining);

setTimeout("doCountdown(countdownRemaining);", 1000);

}

};

kovidgoyal · 02-28-2012, 09:31 AM

Easily fixed. Just change get_article_url to

Code:

    def get_article_url(self, article):
        ans = None
        try:
            s = article.summary
            ans = urllib.unquote(
                re.search(r'href=".+?bookmark.cfm.+?link=(.+?)"', s).group(1))
        except:
            pass
        if ans is not None:
            return ans.replace('?track=rss', '')

cornfieldcraig · 02-28-2012, 08:23 PM

Thanks Kovid.

get_article_url already had been redefined in that recipe to:

def get_article_url(self, article):
print article.get('feedburner_origlink', article.get('guid', article.get('link')))
return article.get('feedburner_origlink', article.get('guid', article.get('link')))

Simply replacing it with the new code doesn't seem to work, unfortunately. I suspect that the two solutions need to be merged somehow.

kovidgoyal · 02-28-2012, 10:21 PM

Code:

    def get_article_url(self, article):
        ans = None
        try:
            s = article.summary
            ans = urllib.unquote(
                re.search(r'href=".+?bookmark.cfm.+?link=(.+?)"', s).group(1))
        except:
            pass
        if ans is None:
            ans = article.get('feedburner_origlink', article.get('guid', article.get('link')))
        if ans is not None:
            return ans.replace('?track=rss', '')

cornfieldcraig · 02-29-2012, 08:55 AM

Thanks again Kovid, but still no go. With this last change, we're back to at least retrieving the titles, but still no actual content. It doesn't seem to be getting past the countdown timer.

Here's one of the links it attempted to retrieve today: http://chicagotribune.feedsportal.co...ss/story01.htm

kovidgoyal · 02-29-2012, 09:27 AM

I just tried it with that and got a full download.

cornfieldcraig · 03-01-2012, 12:42 AM

That's really strange. Just to eliminate as many variables as possible, I took the following steps to implement your suggested change:

Deleted my old customized Chicago Tribune recipe, which included a fix you provided a few weeks ago.
Upgraded Calibre to 8.41
Created a new custom recipe from the built-in Chicago Tribune recipe
Replaced the exising get_article_url function with your new code
Eliminated most of the feeds, just to speed up testing

The result is that it downloads a list of articles, complete with thumbnail descriptions in the table of contents, but the articles themselves contain only the link to the article and no text. I've attached the resulting ePub file. Thanks again for your assistance.

Here's the recipe I am using:

Code:

from __future__ import with_statement
__license__ = 'GPL 3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'

from calibre.web.feeds.news import BasicNewsRecipe

class ChicagoTribune(BasicNewsRecipe):

    title       = 'Chicago Tribune'
    __author__  = 'Kovid Goyal and Sujata Raman, a.peter'
    description = 'Politics, local and business news from Chicago'
    language    = 'en'
    version     = 2

    use_embedded_content = False
    no_stylesheets       = True
    remove_javascript    = True
    recursions           = 1

    keep_only_tags = [dict(name='div', attrs={'class':["story","entry-asset asset hentry"]}),
                      dict(name='div', attrs={'id':["pagebody","story","maincontentcontainer"]}),
                           ]
    remove_tags_after = [{'class':['photo_article',]}]

    match_regexps = [r'page=[0-9]+']

    remove_tags = [{'id':["moduleArticleTools","content-bottom","rail","articleRelates module","toolSet","relatedrailcontent","div-wrapper","beta","atp-comments","footer",'gallery-subcontent','subFooter']},
                   {'class':["clearfix","relatedTitle","articleRelates module","asset-footer","tools","comments","featurePromo","featurePromo fp-topjobs brownBackground","clearfix fullSpan brownBackground","curvedContent",'nextgen-share-tools','outbrainTools', 'google-ad-story-bottom']},
                   dict(name='font',attrs={'id':["cr-other-headlines"]})]
    extra_css = '''
                    h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
                    h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
                    .byline {font-family:Arial,Helvetica,sans-serif; font-size:xx-small;}
                    .date {font-family:Arial,Helvetica,sans-serif; font-size:xx-small;}
                    p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .copyright {font-family:Arial,Helvetica,sans-serif;font-size:xx-small;text-align:center}
                    .story{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .entry-asset asset hentry{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .pagebody{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .maincontentcontainer{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    .story-body{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
                '''
    feeds = [
             ('Latest news', 'http://feeds.chicagotribune.com/chicagotribune/news/'),
             ('Julie\'s Health Club', 'http://feeds.chicagotribune.com/chicagotribune_julieshealthclub/'),
             ]


#    def get_article_url(self, article):
#       url = article.get('feedburner_origlink', article.get('guid', article.get('link')))
#      if url.endswith('?track=rss'):
#         url = url.partition('?')[0]
#    return url

    def get_article_url(self, article):
        ans = None
        try:
            s = article.summary
            ans = urllib.unquote(
                re.search(r'href=".+?bookmark.cfm.+?link=(.+?)"', s).group(1))
        except:
            pass
        if ans is None:
            ans = article.get('feedburner_origlink', article.get('guid', article.get('link')))
        if ans is not None:
            return ans.replace('?track=rss', '')

    def skip_ad_pages(self, soup):
        text = soup.find(text='click here to continue to article')
        if text:
            a = text.parent
            url = a.get('href')
            if url:
                return self.index_to_soup(url, raw=True)

    def postprocess_html(self, soup, first_fetch):
        # Remove the navigation bar. It was kept until now to be able to follow
        # the links to further pages. But now we don't need them anymore.
        for nav in soup.findAll(attrs={'class':['toppaginate','article-nav clearfix']}):
            nav.extract()

        for t in soup.findAll(['table', 'tr', 'td']):
            t.name = 'div'

        for tag in soup.findAll('form', dict(attrs={'name':["comments_form"]})):
            tag.extract()
        for tag in soup.findAll('font', dict(attrs={'id':["cr-other-headlines"]})):
            tag.extract()

        return soup

kovidgoyal · 03-01-2012, 12:44 AM

Use the builtin recipe. I've already updated it.

cornfieldcraig · 03-01-2012, 09:38 PM

I can't explain it, but the built-in recipe is acting the same. It downloads article titles and descriptions, but nothing else. Same thing on two different computers. One XP and the other Vista.

kovidgoyal · 03-04-2012, 01:28 AM

Hmm well I cannot replicate it, so I cannot fix the recipe for it.

cornfieldcraig · 03-05-2012, 08:50 PM

Thanks anyway, Kovid. The only reason I can think of that you're not seeing the problem is that it has something to do with a cookie, that is, if calibre's browser supports them. Once the first countdown expires in the browser, subsequent pages do not have the countdown timer.

I think if there were a way to go to the URL associated with location.href (if present in the page source), it would work. There's current skip_ad_pages function, which you added a few months ago to skip a very similar countdown timer, looks like it would be close to fixing the problem. What has changed is that before, the javascript displayed the text, "click here to continue to article"; whereas now it displays just the numbers (18 to 0) in the countdown, so the text it displays is no longer unique enough to search on.

current skip_ad_pages function

Code:

    def skip_ad_pages(self, soup):
        text = soup.find(text='click here to continue to article')
        if text:
            a = text.parent
            url = a.get('href')
            if url:
                return self.index_to_soup(url, raw=True)

New Javascript Countdown Code

Code:

<script type="text/javascript">
           $(document).ready(function(){
               doCountdown(18000/1000);
			   
setTimeout( 'location.href = \'http://www.chicagotribune.com/news/chi-shootings-in-washington-park-englewood-leave-1-dead-1-wounded-20120302,0,4578641.story?track=rss\'',18000);

           });

           function doCountdown(countdownTime) {
               countdownRemaining = countdownTime - 1;
               if(countdownRemaining > 0) {
                   $("#timeCountdown").text(countdownRemaining);
                   setTimeout("doCountdown(countdownRemaining);", 1000);
               }
           };



</script>

02-28-2012, 09:12 AM	#1
cornfieldcraig Member Posts: 12 Karma: 10 Join Date: Sep 2011 Location: Chicago, Illinois, USA Device: Nook Simple Touch	More Chicago Tribune antics - recipe broken again Well, it looks like the Chicago Tribune is trying harder to prevent folks from downloading their RSS content in batch mode. They've implemented a new countdown timer which pops up a black box with counts down from 18 seconds to zero, then displays the article. There's no text in the box, just the numbers, but if you close the box by clicking the pseudo-X in the upper right-hand corner, it goes to the article immediately. The result is no articles actually download -- just the article titles. Here's a sample URL: http://feedproxy.google.com/~r/chica...Bg/story01.htm Here's the bit of code that seems to be causing the problem: <script type="text/javascript"> $(document).ready(function(){ doCountdown(18000/1000); setTimeout( 'location.href = \'http://www.chicagotribune.com/sports/hockey/blackhawks/ct-spt-0228-blackhawks-trade-chicago--20120228,0,3141821.story?track=rss\'',18000); }); function doCountdown(countdownTime) { countdownRemaining = countdownTime - 1; if(countdownRemaining > 0) { $("#timeCountdown").text(countdownRemaining); setTimeout("doCountdown(countdownRemaining);", 1000); } };

03-05-2012, 08:50 PM	#11
cornfieldcraig Member Posts: 12 Karma: 10 Join Date: Sep 2011 Location: Chicago, Illinois, USA Device: Nook Simple Touch	Thanks anyway, Kovid. The only reason I can think of that you're not seeing the problem is that it has something to do with a cookie, that is, if calibre's browser supports them. Once the first countdown expires in the browser, subsequent pages do not have the countdown timer. I think if there were a way to go to the URL associated with location.href (if present in the page source), it would work. There's current skip_ad_pages function, which you added a few months ago to skip a very similar countdown timer, looks like it would be close to fixing the problem. What has changed is that before, the javascript displayed the text, "click here to continue to article"; whereas now it displays just the numbers (18 to 0) in the countdown, so the text it displays is no longer unique enough to search on. current skip_ad_pages function Code: def skip_ad_pages(self, soup): text = soup.find(text='click here to continue to article') if text: a = text.parent url = a.get('href') if url: return self.index_to_soup(url, raw=True) New Javascript Countdown Code Code: <script type="text/javascript"> $(document).ready(function(){ doCountdown(18000/1000); setTimeout( 'location.href = \'http://www.chicagotribune.com/news/chi-shootings-in-washington-park-englewood-leave-1-dead-1-wounded-20120302,0,4578641.story?track=rss\'',18000); }); function doCountdown(countdownTime) { countdownRemaining = countdownTime - 1; if(countdownRemaining > 0) { $("#timeCountdown").text(countdownRemaining); setTimeout("doCountdown(countdownRemaining);", 1000); } }; </script>

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Recipe for Chicago Tribune Printers Row?	Sydney's Mom	Recipes	0	02-26-2012 05:26 PM
Chicago Tribune Printer's Row	moyogi2	General Discussions	1	02-09-2012 09:49 PM
Chicago Tribune Recipe appears broken	cornfieldcraig	Recipes	4	02-02-2012 10:43 PM
Chicago Tribune Recipe not selecting full article	cornfieldcraig	Recipes	3	09-29-2011 02:31 AM
Chicago Tribune now available on the Kindle!	daffy4u	Amazon Kindle	14	08-11-2008 01:10 PM

02-28-2012, 08:23 PM	#3
cornfieldcraig Member Posts: 12 Karma: 10 Join Date: Sep 2011 Location: Chicago, Illinois, USA Device: Nook Simple Touch	Thanks Kovid. get_article_url already had been redefined in that recipe to: def get_article_url(self, article): print article.get('feedburner_origlink', article.get('guid', article.get('link'))) return article.get('feedburner_origlink', article.get('guid', article.get('link'))) Simply replacing it with the new code doesn't seem to work, unfortunately. I suspect that the two solutions need to be merged somehow.

02-29-2012, 08:55 AM	#5
cornfieldcraig Member Posts: 12 Karma: 10 Join Date: Sep 2011 Location: Chicago, Illinois, USA Device: Nook Simple Touch	Thanks again Kovid, but still no go. With this last change, we're back to at least retrieving the titles, but still no actual content. It doesn't seem to be getting past the countdown timer. Here's one of the links it attempted to retrieve today: http://chicagotribune.feedsportal.co...ss/story01.htm

02-29-2012, 09:27 AM	#6
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I just tried it with that and got a full download.

03-01-2012, 12:44 AM	#8
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Use the builtin recipe. I've already updated it.

03-01-2012, 09:38 PM	#9
cornfieldcraig Member Posts: 12 Karma: 10 Join Date: Sep 2011 Location: Chicago, Illinois, USA Device: Nook Simple Touch	I can't explain it, but the built-in recipe is acting the same. It downloads article titles and descriptions, but nothing else. Same thing on two different computers. One XP and the other Vista.

03-04-2012, 01:28 AM	#10
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Hmm well I cannot replicate it, so I cannot fix the recipe for it.