Discard non-existent redirected article?

snarkophilus · 07-21-2012, 10:17 AM

Hi folks,

With my recipe here, I occasionally get the case where the RSS feed points to an invalid article (I guess this is some sort of race condition issue). When this happens, the request for the article redirects to an index page. This wouldn't be a problem, except that the index page has a heap of content and my Sony PRS-T1 spends a minute or two trying to render it.

Ideally, I'd like to discard the page if I can detect a redirect to an index URL. Here's part of a debug log with the cookie hidden:

Spoiler:

I'd like to try to detect the bold bit. I've spent a bit of time trying to dig around - the closest I can find is feed.articles.remove() when called from parse_feeds(), but this seems to be before the articles are downloaded, so before I can detect the redirect.

Is what I want to do possible?

Cheers,
Simon.

kovidgoyal · 07-21-2012, 10:36 AM

Implement preprocess_raw_html or preprocess_html and return None when you detect the index page.

snarkophilus · 07-22-2012, 03:58 AM

Thanks for the hint!

I've got the following that appears to work for my test. The index doesn't have a og:url property, so that seems pretty safe to determine if we're looking at a valid article. I'm returning the "invalid article" to make sure I can see when the problem actually happens.

Code:

    def preprocess_html(self, soup):
        url = soup.find('meta', attrs={'property':'og:url'})
        if url is None:
            return BeautifulSoup('<html><head><title>Invalid article</title></head><body>Invalid article</body></html>')
        return soup

Two more questions:

1) In preprocess_html can you access the original article structure (ie, url, title, etc)? I can't see anything like this any the recipes I looked at (there's a lot!).
2) In preprocess_html, can I access the actual fetched URL - the header: Location: http://www.autosport.com/news/ in my log file? That seems safer than looking for a given bit of metadata in the html.

Cheers,
Simon.

kovidgoyal · 07-22-2012, 04:03 AM

1) No
2) No, use preprocess_html_raw

snarkophilus · 07-27-2012, 02:47 AM

I got a proper "invalid article" today in my daily news, and it was a simple page saying that instead of the slow-to-load home page. Success!

I still couldn't figure out how to get preprocess_html_raw to work, but I've achieved my initial goal, so I'm happy. Thanks for your help!

Cheers,
Simon.

07-21-2012, 10:17 AM	#1
snarkophilus Wannabe Connoisseur Posts: 426 Karma: 2516674 Join Date: Apr 2011 Location: Geelong, Australia Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX	Discard non-existent redirected article? Hi folks, With my recipe here, I occasionally get the case where the RSS feed points to an invalid article (I guess this is some sort of race condition issue). When this happens, the request for the article redirects to an index page. This wouldn't be a problem, except that the index page has a heap of content and my Sony PRS-T1 spends a minute or two trying to render it. Ideally, I'd like to discard the page if I can detect a redirect to an index URL. Here's part of a debug log with the cookie hidden: Spoiler: Fetching http://www.autosport.com/news/report.php/id/101289 Downloaded article: Kovalainen upbeat after aero test from http://www.autosport.com/news/report.php/id/101288 17% Article downloaded: Kovalainen upbeat after aero test send: 'GET /news/report.php/id/101289 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.autosport.com\r\nCookie: xxx\r\nConnection: close\r\nAccept: /\r\nUser-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.13) Gecko/20101210 Gentoo Firefox/3.6.13\r\n\r\n' reply: 'HTTP/1.1 302 Found\r\n' header: Date: Sat, 21 Jul 2012 13:55:09 GMT header: Server: Apache header: Expires: Thu, 19 Nov 1981 08:52:00 GMT header: Last-Modified: Sat, 21 Jul 2012 13:55:09GMT header: Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0 header: Pragma: no-cache header: Location: http://www.autosport.com/news/ header: Vary: Accept-Encoding,User-Agent header: Content-Length: 0 header: Connection: close header: Content-Type: text/html I'd like to try to detect the bold bit. I've spent a bit of time trying to dig around - the closest I can find is feed.articles.remove() when called from parse_feeds(), but this seems to be before the articles are downloaded, so before I can detect the redirect. Is what I want to do possible? Cheers, Simon.

07-22-2012, 03:58 AM	#3
snarkophilus Wannabe Connoisseur Posts: 426 Karma: 2516674 Join Date: Apr 2011 Location: Geelong, Australia Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX	Thanks for the hint! I've got the following that appears to work for my test. The index doesn't have a og:url property, so that seems pretty safe to determine if we're looking at a valid article. I'm returning the "invalid article" to make sure I can see when the problem actually happens. Code: def preprocess_html(self, soup): url = soup.find('meta', attrs={'property':'og:url'}) if url is None: return BeautifulSoup('<html><head><title>Invalid article</title></head><body>Invalid article</body></html>') return soup Two more questions: 1) In preprocess_html can you access the original article structure (ie, url, title, etc)? I can't see anything like this any the recipes I looked at (there's a lot!). 2) In preprocess_html, can I access the actual fetched URL - the header: Location: http://www.autosport.com/news/ in my log file? That seems safer than looking for a given bit of metadata in the html. Cheers, Simon.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
headline of first article is shown, furthers are not in the article pages	DisasterArea	Recipes	1	02-02-2012 05:29 PM
Only one image per article	Robin Gardner	Recipes	0	02-04-2011 08:50 AM
Decorate article headings as hyperlinks to full article?	tomsem	Recipes	5	10-15-2010 08:30 PM
Article from the Independant	Flub	News	21	09-08-2008 12:07 AM

07-21-2012, 10:36 AM	#2
kovidgoyal creator of calibre Posts: 45,267 Karma: 27111060 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Implement preprocess_raw_html or preprocess_html and return None when you detect the index page.

07-22-2012, 04:03 AM	#4
kovidgoyal creator of calibre Posts: 45,267 Karma: 27111060 Join Date: Oct 2006 Location: Mumbai, India Device: Various	1) No 2) No, use preprocess_html_raw

07-27-2012, 02:47 AM	#5
snarkophilus Wannabe Connoisseur Posts: 426 Karma: 2516674 Join Date: Apr 2011 Location: Geelong, Australia Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX	I got a proper "invalid article" today in my daily news, and it was a simple page saying that instead of the slow-to-load home page. Success! I still couldn't figure out how to get preprocess_html_raw to work, but I've achieved my initial goal, so I'm happy. Thanks for your help! Cheers, Simon.

Advert

Advert