View Single Post
Old 10-29-2011, 06:08 AM   #1
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
"preprocess_regexps = [(re.compile..." bugged?

Hi Chaps.

Can someone confirm that

preprocess_regexps = [
(re.compile(r'<head>.*</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')]

OR

preprocess_regexps = [
(re.compile(r'<head>.*?</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')]

Should totally remove a downloaded pages <head> section.

The reason I ask is this. I download the UK tabloid "The Sun" which on the whole works okay. Except, every so many articles have corruption above the navbar (i've attached an image showing it).

I've checked the source of that text and it's in the line

<meta name="og:title" content="Vincent Tabak: <br/>Hooked on hookers"/>

BUT, this line is contained within the <head></head> section, so surely it should have been removed?
Also, why does it appear above the navbar and not in the article? (Is this a bug?)

The particular page in the example is
http://www.thesun.co.uk/sol/homepage...n-hookers.html

Help would really be appreciated
Attached Thumbnails
Click image for larger version

Name:	calibre.jpg
Views:	298
Size:	40.0 KB
ID:	78303  
scissors is offline   Reply With Quote