02-28-2012, 09:12 AM | #1 |
Member
Posts: 12
Karma: 10
Join Date: Sep 2011
Location: Chicago, Illinois, USA
Device: Nook Simple Touch
|
More Chicago Tribune antics - recipe broken again
Well, it looks like the Chicago Tribune is trying harder to prevent folks from downloading their RSS content in batch mode. They've implemented a new countdown timer which pops up a black box with counts down from 18 seconds to zero, then displays the article. There's no text in the box, just the numbers, but if you close the box by clicking the pseudo-X in the upper right-hand corner, it goes to the article immediately. The result is no articles actually download -- just the article titles. Here's a sample URL: http://feedproxy.google.com/~r/chica...Bg/story01.htm Here's the bit of code that seems to be causing the problem: <script type="text/javascript"> $(document).ready(function(){ doCountdown(18000/1000); setTimeout( 'location.href = \'http://www.chicagotribune.com/sports/hockey/blackhawks/ct-spt-0228-blackhawks-trade-chicago--20120228,0,3141821.story?track=rss\'',18000); }); function doCountdown(countdownTime) { countdownRemaining = countdownTime - 1; if(countdownRemaining > 0) { $("#timeCountdown").text(countdownRemaining); setTimeout("doCountdown(countdownRemaining);", 1000); } }; |
02-28-2012, 09:31 AM | #2 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Easily fixed. Just change get_article_url to
Code:
def get_article_url(self, article): ans = None try: s = article.summary ans = urllib.unquote( re.search(r'href=".+?bookmark.cfm.+?link=(.+?)"', s).group(1)) except: pass if ans is not None: return ans.replace('?track=rss', '') |
02-28-2012, 08:23 PM | #3 |
Member
Posts: 12
Karma: 10
Join Date: Sep 2011
Location: Chicago, Illinois, USA
Device: Nook Simple Touch
|
Thanks Kovid.
get_article_url already had been redefined in that recipe to: def get_article_url(self, article): print article.get('feedburner_origlink', article.get('guid', article.get('link'))) return article.get('feedburner_origlink', article.get('guid', article.get('link'))) Simply replacing it with the new code doesn't seem to work, unfortunately. I suspect that the two solutions need to be merged somehow. |
02-28-2012, 10:21 PM | #4 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Code:
def get_article_url(self, article): ans = None try: s = article.summary ans = urllib.unquote( re.search(r'href=".+?bookmark.cfm.+?link=(.+?)"', s).group(1)) except: pass if ans is None: ans = article.get('feedburner_origlink', article.get('guid', article.get('link'))) if ans is not None: return ans.replace('?track=rss', '') |
02-29-2012, 08:55 AM | #5 |
Member
Posts: 12
Karma: 10
Join Date: Sep 2011
Location: Chicago, Illinois, USA
Device: Nook Simple Touch
|
Thanks again Kovid, but still no go. With this last change, we're back to at least retrieving the titles, but still no actual content. It doesn't seem to be getting past the countdown timer.
Here's one of the links it attempted to retrieve today: http://chicagotribune.feedsportal.co...ss/story01.htm |
02-29-2012, 09:27 AM | #6 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I just tried it with that and got a full download.
|
03-01-2012, 12:42 AM | #7 |
Member
Posts: 12
Karma: 10
Join Date: Sep 2011
Location: Chicago, Illinois, USA
Device: Nook Simple Touch
|
That's really strange. Just to eliminate as many variables as possible, I took the following steps to implement your suggested change:
Here's the recipe I am using: Code:
from __future__ import with_statement __license__ = 'GPL 3' __copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>' __docformat__ = 'restructuredtext en' from calibre.web.feeds.news import BasicNewsRecipe class ChicagoTribune(BasicNewsRecipe): title = 'Chicago Tribune' __author__ = 'Kovid Goyal and Sujata Raman, a.peter' description = 'Politics, local and business news from Chicago' language = 'en' version = 2 use_embedded_content = False no_stylesheets = True remove_javascript = True recursions = 1 keep_only_tags = [dict(name='div', attrs={'class':["story","entry-asset asset hentry"]}), dict(name='div', attrs={'id':["pagebody","story","maincontentcontainer"]}), ] remove_tags_after = [{'class':['photo_article',]}] match_regexps = [r'page=[0-9]+'] remove_tags = [{'id':["moduleArticleTools","content-bottom","rail","articleRelates module","toolSet","relatedrailcontent","div-wrapper","beta","atp-comments","footer",'gallery-subcontent','subFooter']}, {'class':["clearfix","relatedTitle","articleRelates module","asset-footer","tools","comments","featurePromo","featurePromo fp-topjobs brownBackground","clearfix fullSpan brownBackground","curvedContent",'nextgen-share-tools','outbrainTools', 'google-ad-story-bottom']}, dict(name='font',attrs={'id':["cr-other-headlines"]})] extra_css = ''' h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;} h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;} .byline {font-family:Arial,Helvetica,sans-serif; font-size:xx-small;} .date {font-family:Arial,Helvetica,sans-serif; font-size:xx-small;} p{font-family:Arial,Helvetica,sans-serif;font-size:small;} .copyright {font-family:Arial,Helvetica,sans-serif;font-size:xx-small;text-align:center} .story{font-family:Arial,Helvetica,sans-serif;font-size:small;} .entry-asset asset hentry{font-family:Arial,Helvetica,sans-serif;font-size:small;} .pagebody{font-family:Arial,Helvetica,sans-serif;font-size:small;} .maincontentcontainer{font-family:Arial,Helvetica,sans-serif;font-size:small;} .story-body{font-family:Arial,Helvetica,sans-serif;font-size:small;} body{font-family:Helvetica,Arial,sans-serif;font-size:small;} ''' feeds = [ ('Latest news', 'http://feeds.chicagotribune.com/chicagotribune/news/'), ('Julie\'s Health Club', 'http://feeds.chicagotribune.com/chicagotribune_julieshealthclub/'), ] # def get_article_url(self, article): # url = article.get('feedburner_origlink', article.get('guid', article.get('link'))) # if url.endswith('?track=rss'): # url = url.partition('?')[0] # return url def get_article_url(self, article): ans = None try: s = article.summary ans = urllib.unquote( re.search(r'href=".+?bookmark.cfm.+?link=(.+?)"', s).group(1)) except: pass if ans is None: ans = article.get('feedburner_origlink', article.get('guid', article.get('link'))) if ans is not None: return ans.replace('?track=rss', '') def skip_ad_pages(self, soup): text = soup.find(text='click here to continue to article') if text: a = text.parent url = a.get('href') if url: return self.index_to_soup(url, raw=True) def postprocess_html(self, soup, first_fetch): # Remove the navigation bar. It was kept until now to be able to follow # the links to further pages. But now we don't need them anymore. for nav in soup.findAll(attrs={'class':['toppaginate','article-nav clearfix']}): nav.extract() for t in soup.findAll(['table', 'tr', 'td']): t.name = 'div' for tag in soup.findAll('form', dict(attrs={'name':["comments_form"]})): tag.extract() for tag in soup.findAll('font', dict(attrs={'id':["cr-other-headlines"]})): tag.extract() return soup |
03-01-2012, 12:44 AM | #8 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Use the builtin recipe. I've already updated it.
|
03-01-2012, 09:38 PM | #9 |
Member
Posts: 12
Karma: 10
Join Date: Sep 2011
Location: Chicago, Illinois, USA
Device: Nook Simple Touch
|
I can't explain it, but the built-in recipe is acting the same. It downloads article titles and descriptions, but nothing else. Same thing on two different computers. One XP and the other Vista.
|
03-04-2012, 01:28 AM | #10 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Hmm well I cannot replicate it, so I cannot fix the recipe for it.
|
03-05-2012, 08:50 PM | #11 |
Member
Posts: 12
Karma: 10
Join Date: Sep 2011
Location: Chicago, Illinois, USA
Device: Nook Simple Touch
|
Thanks anyway, Kovid. The only reason I can think of that you're not seeing the problem is that it has something to do with a cookie, that is, if calibre's browser supports them. Once the first countdown expires in the browser, subsequent pages do not have the countdown timer.
I think if there were a way to go to the URL associated with location.href (if present in the page source), it would work. There's current skip_ad_pages function, which you added a few months ago to skip a very similar countdown timer, looks like it would be close to fixing the problem. What has changed is that before, the javascript displayed the text, "click here to continue to article"; whereas now it displays just the numbers (18 to 0) in the countdown, so the text it displays is no longer unique enough to search on. current skip_ad_pages function Code:
def skip_ad_pages(self, soup): text = soup.find(text='click here to continue to article') if text: a = text.parent url = a.get('href') if url: return self.index_to_soup(url, raw=True) Code:
<script type="text/javascript"> $(document).ready(function(){ doCountdown(18000/1000); setTimeout( 'location.href = \'http://www.chicagotribune.com/news/chi-shootings-in-washington-park-englewood-leave-1-dead-1-wounded-20120302,0,4578641.story?track=rss\'',18000); }); function doCountdown(countdownTime) { countdownRemaining = countdownTime - 1; if(countdownRemaining > 0) { $("#timeCountdown").text(countdownRemaining); setTimeout("doCountdown(countdownRemaining);", 1000); } }; </script> |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Recipe for Chicago Tribune Printers Row? | Sydney's Mom | Recipes | 0 | 02-26-2012 05:26 PM |
Chicago Tribune Printer's Row | moyogi2 | General Discussions | 1 | 02-09-2012 09:49 PM |
Chicago Tribune Recipe appears broken | cornfieldcraig | Recipes | 4 | 02-02-2012 10:43 PM |
Chicago Tribune Recipe not selecting full article | cornfieldcraig | Recipes | 3 | 09-29-2011 02:31 AM |
Chicago Tribune now available on the Kindle! | daffy4u | Amazon Kindle | 14 | 08-11-2008 01:10 PM |