MobileRead Forums - View Single Post - Discard non-existent redirected article?

snarkophilus · 07-22-2012, 04:58 AM

Thanks for the hint!

I've got the following that appears to work for my test. The index doesn't have a og:url property, so that seems pretty safe to determine if we're looking at a valid article. I'm returning the "invalid article" to make sure I can see when the problem actually happens.

Code:

    def preprocess_html(self, soup):
        url = soup.find('meta', attrs={'property':'og:url'})
        if url is None:
            return BeautifulSoup('<html><head><title>Invalid article</title></head><body>Invalid article</body></html>')
        return soup

Two more questions:

1) In preprocess_html can you access the original article structure (ie, url, title, etc)? I can't see anything like this any the recipes I looked at (there's a lot!).
2) In preprocess_html, can I access the actual fetched URL - the header: Location: http://www.autosport.com/news/ in my log file? That seems safer than looking for a given bit of metadata in the html.

Cheers,
Simon.

07-22-2012, 04:58 AM	#3
snarkophilus Wannabe Connoisseur Posts: 426 Karma: 2516674 Join Date: Apr 2011 Location: Geelong, Australia Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX	Thanks for the hint! I've got the following that appears to work for my test. The index doesn't have a og:url property, so that seems pretty safe to determine if we're looking at a valid article. I'm returning the "invalid article" to make sure I can see when the problem actually happens. Code: def preprocess_html(self, soup): url = soup.find('meta', attrs={'property':'og:url'}) if url is None: return BeautifulSoup('<html><head><title>Invalid article</title></head><body>Invalid article</body></html>') return soup Two more questions: 1) In preprocess_html can you access the original article structure (ie, url, title, etc)? I can't see anything like this any the recipes I looked at (there's a lot!). 2) In preprocess_html, can I access the actual fetched URL - the header: Location: http://www.autosport.com/news/ in my log file? That seems safer than looking for a given bit of metadata in the html. Cheers, Simon.