![]() |
#1 |
Wannabe Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 426
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
|
![]()
Hi folks,
With my recipe here, I occasionally get the case where the RSS feed points to an invalid article (I guess this is some sort of race condition issue). When this happens, the request for the article redirects to an index page. This wouldn't be a problem, except that the index page has a heap of content and my Sony PRS-T1 spends a minute or two trying to render it. Ideally, I'd like to discard the page if I can detect a redirect to an index URL. Here's part of a debug log with the cookie hidden: Spoiler:
I'd like to try to detect the bold bit. I've spent a bit of time trying to dig around - the closest I can find is feed.articles.remove() when called from parse_feeds(), but this seems to be before the articles are downloaded, so before I can detect the redirect. Is what I want to do possible? Cheers, Simon. |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,253
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Implement preprocess_raw_html or preprocess_html and return None when you detect the index page.
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Wannabe Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 426
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
|
Thanks for the hint!
I've got the following that appears to work for my test. The index doesn't have a og:url property, so that seems pretty safe to determine if we're looking at a valid article. I'm returning the "invalid article" to make sure I can see when the problem actually happens. Code:
def preprocess_html(self, soup): url = soup.find('meta', attrs={'property':'og:url'}) if url is None: return BeautifulSoup('<html><head><title>Invalid article</title></head><body>Invalid article</body></html>') return soup 1) In preprocess_html can you access the original article structure (ie, url, title, etc)? I can't see anything like this any the recipes I looked at (there's a lot!). 2) In preprocess_html, can I access the actual fetched URL - the header: Location: http://www.autosport.com/news/ in my log file? That seems safer than looking for a given bit of metadata in the html. Cheers, Simon. |
![]() |
![]() |
![]() |
#4 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,253
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
1) No
2) No, use preprocess_html_raw |
![]() |
![]() |
![]() |
#5 |
Wannabe Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 426
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
|
I got a proper "invalid article" today in my daily news, and it was a simple page saying that instead of the slow-to-load home page. Success!
I still couldn't figure out how to get preprocess_html_raw to work, but I've achieved my initial goal, so I'm happy. Thanks for your help! Cheers, Simon. |
![]() |
![]() |
Advert | |
|
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
headline of first article is shown, furthers are not in the article pages | DisasterArea | Recipes | 1 | 02-02-2012 05:29 PM |
Only one image per article | Robin Gardner | Recipes | 0 | 02-04-2011 08:50 AM |
Decorate article headings as hyperlinks to full article? | tomsem | Recipes | 5 | 10-15-2010 08:30 PM |
Article from the Independant | Flub | News | 21 | 09-08-2008 12:07 AM |