Hi Chaps.
Can someone confirm that
preprocess_regexps = [
(re.compile(r'<head>.*</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')]
OR
preprocess_regexps = [
(re.compile(r'<head>.*?</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')]
Should totally remove a downloaded pages <head> section.
The reason I ask is this. I download the UK tabloid "The Sun" which on the whole works okay. Except, every so many articles have corruption above the navbar (i've attached an image showing it).
I've checked the source of that text and it's in the line
<meta name="og:title" content="Vincent Tabak: <br/>Hooked on hookers"/>
BUT, this line is contained within the <head></head> section, so surely it should have been removed?
Also, why does it appear above the navbar and not in the article? (Is this a bug?)
The particular page in the example is
http://www.thesun.co.uk/sol/homepage...n-hookers.html
Help would really be appreciated