MobileRead Forums - View Single Post - "preprocess_regexps = [(re.compile..." bugged?

scissors · 10-29-2011, 06:08 AM

Hi Chaps.

Can someone confirm that

preprocess_regexps = [
(re.compile(r'<head>.*</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')]

OR

preprocess_regexps = [
(re.compile(r'<head>.*?</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')]

Should totally remove a downloaded pages <head> section.

The reason I ask is this. I download the UK tabloid "The Sun" which on the whole works okay. Except, every so many articles have corruption above the navbar (i've attached an image showing it).

I've checked the source of that text and it's in the line

<meta name="og:title" content="Vincent Tabak: <br/>Hooked on hookers"/>

BUT, this line is contained within the <head></head> section, so surely it should have been removed?
Also, why does it appear above the navbar and not in the article? (Is this a bug?)

The particular page in the example is
http://www.thesun.co.uk/sol/homepage...n-hookers.html

Help would really be appreciated

10-29-2011, 06:08 AM	#1
scissors Addict Posts: 241 Karma: 1001369 Join Date: Sep 2010 Device: prs300, kindle keyboard 3g	"preprocess_regexps = [(re.compile..." bugged? Hi Chaps. Can someone confirm that preprocess_regexps = [ (re.compile(r'<head>.</head>', re.IGNORECASE \| re.DOTALL), lambda match: '<head></head>')] OR preprocess_regexps = [ (re.compile(r'<head>.?</head>', re.IGNORECASE \| re.DOTALL), lambda match: '<head></head>')] Should totally remove a downloaded pages <head> section. The reason I ask is this. I download the UK tabloid "The Sun" which on the whole works okay. Except, every so many articles have corruption above the navbar (i've attached an image showing it). I've checked the source of that text and it's in the line <meta name="og:title" content="Vincent Tabak: <br/>Hooked on hookers"/> BUT, this line is contained within the <head></head> section, so surely it should have been removed? Also, why does it appear above the navbar and not in the article? (Is this a bug?) The particular page in the example is http://www.thesun.co.uk/sol/homepage...n-hookers.html Help would really be appreciated Attached Thumbnails