Thanks for the hint!
I've got the following that appears to work for my test. The index doesn't have a og:url property, so that seems pretty safe to determine if we're looking at a valid article. I'm returning the "invalid article" to make sure I can see when the problem actually happens.
Code:
def preprocess_html(self, soup):
url = soup.find('meta', attrs={'property':'og:url'})
if url is None:
return BeautifulSoup('<html><head><title>Invalid article</title></head><body>Invalid article</body></html>')
return soup
Two more questions:
1) In preprocess_html can you access the original article structure (ie, url, title, etc)? I can't see anything like this any the recipes I looked at (there's a lot!).
2) In preprocess_html, can I access the actual fetched URL - the
header: Location: http://www.autosport.com/news/ in my log file? That seems safer than looking for a given bit of metadata in the html.
Cheers,
Simon.