View Single Post
Old 07-22-2012, 03:58 AM   #3
snarkophilus
Wannabe Connoisseur
snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.
 
Posts: 426
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
Thanks for the hint!

I've got the following that appears to work for my test. The index doesn't have a og:url property, so that seems pretty safe to determine if we're looking at a valid article. I'm returning the "invalid article" to make sure I can see when the problem actually happens.

Code:
    def preprocess_html(self, soup):
        url = soup.find('meta', attrs={'property':'og:url'})
        if url is None:
            return BeautifulSoup('<html><head><title>Invalid article</title></head><body>Invalid article</body></html>')
        return soup
Two more questions:

1) In preprocess_html can you access the original article structure (ie, url, title, etc)? I can't see anything like this any the recipes I looked at (there's a lot!).
2) In preprocess_html, can I access the actual fetched URL - the header: Location: http://www.autosport.com/news/ in my log file? That seems safer than looking for a given bit of metadata in the html.

Cheers,
Simon.
snarkophilus is offline   Reply With Quote