Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 07-21-2012, 10:17 AM   #1
snarkophilus
Wannabe Connoisseur
snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.
 
Posts: 426
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
Question Discard non-existent redirected article?

Hi folks,

With my recipe here, I occasionally get the case where the RSS feed points to an invalid article (I guess this is some sort of race condition issue). When this happens, the request for the article redirects to an index page. This wouldn't be a problem, except that the index page has a heap of content and my Sony PRS-T1 spends a minute or two trying to render it.

Ideally, I'd like to discard the page if I can detect a redirect to an index URL. Here's part of a debug log with the cookie hidden:

Spoiler:

Fetching http://www.autosport.com/news/report.php/id/101289
Downloaded article: Kovalainen upbeat after aero test from http://www.autosport.com/news/report.php/id/101288
17% Article downloaded: Kovalainen upbeat after aero test
send: 'GET /news/report.php/id/101289 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.autosport.com\r\nCookie: xxx\r\nConnection: close\r\nAccept: */*\r\nUser-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.13) Gecko/20101210 Gentoo Firefox/3.6.13\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Date: Sat, 21 Jul 2012 13:55:09 GMT
header: Server: Apache
header: Expires: Thu, 19 Nov 1981 08:52:00 GMT
header: Last-Modified: Sat, 21 Jul 2012 13:55:09GMT
header: Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
header: Pragma: no-cache
header: Location: http://www.autosport.com/news/
header: Vary: Accept-Encoding,User-Agent
header: Content-Length: 0
header: Connection: close
header: Content-Type: text/html


I'd like to try to detect the bold bit. I've spent a bit of time trying to dig around - the closest I can find is feed.articles.remove() when called from parse_feeds(), but this seems to be before the articles are downloaded, so before I can detect the redirect.

Is what I want to do possible?

Cheers,
Simon.
snarkophilus is offline   Reply With Quote
Old 07-21-2012, 10:36 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,253
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Implement preprocess_raw_html or preprocess_html and return None when you detect the index page.
kovidgoyal is offline   Reply With Quote
Advert
Old 07-22-2012, 03:58 AM   #3
snarkophilus
Wannabe Connoisseur
snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.
 
Posts: 426
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
Thanks for the hint!

I've got the following that appears to work for my test. The index doesn't have a og:url property, so that seems pretty safe to determine if we're looking at a valid article. I'm returning the "invalid article" to make sure I can see when the problem actually happens.

Code:
    def preprocess_html(self, soup):
        url = soup.find('meta', attrs={'property':'og:url'})
        if url is None:
            return BeautifulSoup('<html><head><title>Invalid article</title></head><body>Invalid article</body></html>')
        return soup
Two more questions:

1) In preprocess_html can you access the original article structure (ie, url, title, etc)? I can't see anything like this any the recipes I looked at (there's a lot!).
2) In preprocess_html, can I access the actual fetched URL - the header: Location: http://www.autosport.com/news/ in my log file? That seems safer than looking for a given bit of metadata in the html.

Cheers,
Simon.
snarkophilus is offline   Reply With Quote
Old 07-22-2012, 04:03 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,253
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
1) No
2) No, use preprocess_html_raw
kovidgoyal is offline   Reply With Quote
Old 07-27-2012, 02:47 AM   #5
snarkophilus
Wannabe Connoisseur
snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.
 
Posts: 426
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
I got a proper "invalid article" today in my daily news, and it was a simple page saying that instead of the slow-to-load home page. Success!

I still couldn't figure out how to get preprocess_html_raw to work, but I've achieved my initial goal, so I'm happy. Thanks for your help!

Cheers,
Simon.
snarkophilus is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
headline of first article is shown, furthers are not in the article pages DisasterArea Recipes 1 02-02-2012 05:29 PM
Only one image per article Robin Gardner Recipes 0 02-04-2011 08:50 AM
Decorate article headings as hyperlinks to full article? tomsem Recipes 5 10-15-2010 08:30 PM
Article from the Independant Flub News 21 09-08-2008 12:07 AM


All times are GMT -4. The time now is 05:09 PM.


MobileRead.com is a privately owned, operated and funded community.