"preprocess_regexps = [(re.compile..." bugged?

scissors · 10-29-2011, 06:08 AM

Hi Chaps.

Can someone confirm that

preprocess_regexps = [
(re.compile(r'<head>.*</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')]

OR

preprocess_regexps = [
(re.compile(r'<head>.*?</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')]

Should totally remove a downloaded pages <head> section.

The reason I ask is this. I download the UK tabloid "The Sun" which on the whole works okay. Except, every so many articles have corruption above the navbar (i've attached an image showing it).

I've checked the source of that text and it's in the line

<meta name="og:title" content="Vincent Tabak: <br/>Hooked on hookers"/>

BUT, this line is contained within the <head></head> section, so surely it should have been removed?
Also, why does it appear above the navbar and not in the article? (Is this a bug?)

The particular page in the example is
http://www.thesun.co.uk/sol/homepage...n-hookers.html

Help would really be appreciated

scissors · 10-29-2011, 08:34 AM

I managed to remove the "junk" by a simple

remove_tags=[dict(name='head'),

I'd really like to know why the reg exp command failed though...

scissors · 10-29-2011, 11:15 AM

correction. Using the remove tag <head>, calibre added a class at the end of each article web page that meant some css coding was being displayed at the end of the article as though it was article text - only on my prs300 - not in calibres viewer and not in firefox.

So I removed the remove tag <head> command and the garbage re-appears in the top navbar when viewed in calibre - but now displays okay on the prs300.

It truly is a black art...

Current version code is the one displaying correctly on sony. It's only a minor niggle about the bit of text appearing in the top nav bar but if Kovid, Starson et all could maybe offer a possible reason. I've been on it all day and can't figure out why.

Starson17 · 11-01-2011, 02:46 PM

Quote:

Originally Posted by scissors

Hi Chaps.

Can someone confirm that

preprocess_regexps = [
(re.compile(r'<head>.*</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')]

OR

preprocess_regexps = [
(re.compile(r'<head>.*?</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')]

Should totally remove a downloaded pages <head> section.

Not necessarily. The head tag might have attributes. I'd have checked to be sure, but this should work:

Code:

preprocess_regexps = [(re.compile(r'<head.*</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')]

scissors · 11-01-2011, 03:33 PM

Hi Starson.

Thanks for the reply. I thought that was the case - but as the attached image shows, it doesn't always work.

Any idea why it ends up in the navbar and not in the article?

Serpentine · 11-01-2011, 03:55 PM

Are you sure that Google Reader is not breaking that section of text out of the header as it has a formatting element in it? I have no idea how the reader aggregation works - but I have a feeling it might be doing something funny there.

It seems that the <br/> is problematic, the regex run on the sun site itself works just fine.

Starson17 · 11-02-2011, 10:53 AM

Quote:

Originally Posted by scissors

Hi Starson.

Thanks for the reply. I thought that was the case - but as the attached image shows, it doesn't always work.

I was responding to your question about whether it should "totally remove a downloaded pages <head> section".

Quote:

Any idea why it ends up in the navbar and not in the article?

I agree with Serpentine - it's probably the use of the formatting element in what's supposed to be quoted text in the meta tag. Something is getting confused as to where the tags start/stop - probably BeautifulSoup. I would expect preprocess_regexps to be able to handle it, but I can't be sure.

I'd definitely print the soup before and after preprocess_regexps to see what's coming in and whether it's getting processed correctly.

scissors · 11-02-2011, 02:54 PM

Quote:

Originally Posted by Serpentine

Are you sure that Google Reader is not breaking that section of text out of the header as it has a formatting element in it? I have no idea how the reader aggregation works - but I have a feeling it might be doing something funny there.

It seems that the <br/> is problematic, the regex run on the sun site itself works just fine.

Yeah, I figured it's the <br/> myself. The reason It's via google is I tried a standard feed recipe and for some reason they often fail.

@starson when you say print the soup - I assume the soup is the input html and the created intermediate xhtml. I kinda get the idea of B.soup but the whole syntax etc is beyond me. I study your examples etc but even the way variables are declared is hard for a dummy like myself to get to grips with.

(plus the mrs moans about the amount of time i'm spending on it)

Starson17 · 11-02-2011, 03:56 PM

Quote:

Originally Posted by scissors

Y@starson when you say print the soup - I assume the soup is the input html and the created intermediate xhtml.

Yes, it's the input html. "Soup" is recipe shorthand for html being processed in a recipe by BeautifulSoup. The html processing methods (pre and post) are passed a parameter named "soup" which is the html they receive to be processed. You'll see it in most recipes. It's also used behind the scenes in all the tag handling methods.

http://www.crummy.com/software/Beaut...mentation.html

10-29-2011, 06:08 AM	#1
scissors Addict Posts: 241 Karma: 1001369 Join Date: Sep 2010 Device: prs300, kindle keyboard 3g	"preprocess_regexps = [(re.compile..." bugged? Hi Chaps. Can someone confirm that preprocess_regexps = [ (re.compile(r'<head>.</head>', re.IGNORECASE \| re.DOTALL), lambda match: '<head></head>')] OR preprocess_regexps = [ (re.compile(r'<head>.?</head>', re.IGNORECASE \| re.DOTALL), lambda match: '<head></head>')] Should totally remove a downloaded pages <head> section. The reason I ask is this. I download the UK tabloid "The Sun" which on the whole works okay. Except, every so many articles have corruption above the navbar (i've attached an image showing it). I've checked the source of that text and it's in the line <meta name="og:title" content="Vincent Tabak: <br/>Hooked on hookers"/> BUT, this line is contained within the <head></head> section, so surely it should have been removed? Also, why does it appear above the navbar and not in the article? (Is this a bug?) The particular page in the example is http://www.thesun.co.uk/sol/homepage...n-hookers.html Help would really be appreciated Attached Thumbnails

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Feature Request: configurable space setting for "Insert blank line" in "Look & Feel"	therealjoeblow	Calibre	15	07-25-2011 03:14 PM
Yep. It's official. Sony Reader has "ruined" books for me. A final "review."	WilliamG	Sony Reader	48	01-14-2011 03:49 AM
Woher bekomme ich "Infinite Jest" oder "Unendlicher Spaß" von David Foster Wallace?	bitschnau	Erste Hilfe	3	11-01-2010 01:22 PM

10-29-2011, 08:34 AM	#2
scissors Addict Posts: 241 Karma: 1001369 Join Date: Sep 2010 Device: prs300, kindle keyboard 3g	I managed to remove the "junk" by a simple remove_tags=[dict(name='head'), I'd really like to know why the reg exp command failed though...

10-29-2011, 11:15 AM	#3
scissors Addict Posts: 241 Karma: 1001369 Join Date: Sep 2010 Device: prs300, kindle keyboard 3g	correction. Using the remove tag <head>, calibre added a class at the end of each article web page that meant some css coding was being displayed at the end of the article as though it was article text - only on my prs300 - not in calibres viewer and not in firefox. So I removed the remove tag <head> command and the garbage re-appears in the top navbar when viewed in calibre - but now displays okay on the prs300. It truly is a black art... Current version code is the one displaying correctly on sony. It's only a minor niggle about the bit of text appearing in the top nav bar but if Kovid, Starson et all could maybe offer a possible reason. I've been on it all day and can't figure out why.

11-01-2011, 03:33 PM	#5
scissors Addict Posts: 241 Karma: 1001369 Join Date: Sep 2010 Device: prs300, kindle keyboard 3g	Hi Starson. Thanks for the reply. I thought that was the case - but as the attached image shows, it doesn't always work. Any idea why it ends up in the navbar and not in the article?

11-01-2011, 03:55 PM	#6
Serpentine Evangelist Posts: 416 Karma: 1045911 Join Date: Sep 2011 Location: Cape Town, South Africa Device: Kindle 3	Are you sure that Google Reader is not breaking that section of text out of the header as it has a formatting element in it? I have no idea how the reader aggregation works - but I have a feeling it might be doing something funny there. It seems that the <br/> is problematic, the regex run on the sun site itself works just fine.