Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 10-29-2011, 06:08 AM   #1
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
"preprocess_regexps = [(re.compile..." bugged?

Hi Chaps.

Can someone confirm that

preprocess_regexps = [
(re.compile(r'<head>.*</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')]

OR

preprocess_regexps = [
(re.compile(r'<head>.*?</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')]

Should totally remove a downloaded pages <head> section.

The reason I ask is this. I download the UK tabloid "The Sun" which on the whole works okay. Except, every so many articles have corruption above the navbar (i've attached an image showing it).

I've checked the source of that text and it's in the line

<meta name="og:title" content="Vincent Tabak: <br/>Hooked on hookers"/>

BUT, this line is contained within the <head></head> section, so surely it should have been removed?
Also, why does it appear above the navbar and not in the article? (Is this a bug?)

The particular page in the example is
http://www.thesun.co.uk/sol/homepage...n-hookers.html

Help would really be appreciated
Attached Thumbnails
Click image for larger version

Name:	calibre.jpg
Views:	260
Size:	40.0 KB
ID:	78303  
scissors is offline   Reply With Quote
Old 10-29-2011, 08:34 AM   #2
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
I managed to remove the "junk" by a simple

remove_tags=[dict(name='head'),

I'd really like to know why the reg exp command failed though...
scissors is offline   Reply With Quote
Old 10-29-2011, 11:15 AM   #3
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
correction. Using the remove tag <head>, calibre added a class at the end of each article web page that meant some css coding was being displayed at the end of the article as though it was article text - only on my prs300 - not in calibres viewer and not in firefox.

So I removed the remove tag <head> command and the garbage re-appears in the top navbar when viewed in calibre - but now displays okay on the prs300.

It truly is a black art...

Current version code is the one displaying correctly on sony. It's only a minor niggle about the bit of text appearing in the top nav bar but if Kovid, Starson et all could maybe offer a possible reason. I've been on it all day and can't figure out why.
scissors is offline   Reply With Quote
Old 11-01-2011, 02:46 PM   #4
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by scissors View Post
Hi Chaps.

Can someone confirm that

preprocess_regexps = [
(re.compile(r'<head>.*</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')]

OR

preprocess_regexps = [
(re.compile(r'<head>.*?</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')]

Should totally remove a downloaded pages <head> section.
Not necessarily. The head tag might have attributes. I'd have checked to be sure, but this should work:
Code:
preprocess_regexps = [(re.compile(r'<head.*</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')]
Starson17 is offline   Reply With Quote
Old 11-01-2011, 03:33 PM   #5
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
Hi Starson.

Thanks for the reply. I thought that was the case - but as the attached image shows, it doesn't always work.

Any idea why it ends up in the navbar and not in the article?
scissors is offline   Reply With Quote
Old 11-01-2011, 03:55 PM   #6
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
Are you sure that Google Reader is not breaking that section of text out of the header as it has a formatting element in it? I have no idea how the reader aggregation works - but I have a feeling it might be doing something funny there.

It seems that the <br/> is problematic, the regex run on the sun site itself works just fine.
Serpentine is offline   Reply With Quote
Old 11-02-2011, 10:53 AM   #7
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by scissors View Post
Hi Starson.

Thanks for the reply. I thought that was the case - but as the attached image shows, it doesn't always work.
I was responding to your question about whether it should "totally remove a downloaded pages <head> section".
Quote:
Any idea why it ends up in the navbar and not in the article?
I agree with Serpentine - it's probably the use of the formatting element in what's supposed to be quoted text in the meta tag. Something is getting confused as to where the tags start/stop - probably BeautifulSoup. I would expect preprocess_regexps to be able to handle it, but I can't be sure.

I'd definitely print the soup before and after preprocess_regexps to see what's coming in and whether it's getting processed correctly.
Starson17 is offline   Reply With Quote
Old 11-02-2011, 02:54 PM   #8
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
Quote:
Originally Posted by Serpentine View Post
Are you sure that Google Reader is not breaking that section of text out of the header as it has a formatting element in it? I have no idea how the reader aggregation works - but I have a feeling it might be doing something funny there.

It seems that the <br/> is problematic, the regex run on the sun site itself works just fine.
Yeah, I figured it's the <br/> myself. The reason It's via google is I tried a standard feed recipe and for some reason they often fail.

@starson when you say print the soup - I assume the soup is the input html and the created intermediate xhtml. I kinda get the idea of B.soup but the whole syntax etc is beyond me. I study your examples etc but even the way variables are declared is hard for a dummy like myself to get to grips with.

(plus the mrs moans about the amount of time i'm spending on it)
scissors is offline   Reply With Quote
Old 11-02-2011, 03:56 PM   #9
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by scissors View Post
Y@starson when you say print the soup - I assume the soup is the input html and the created intermediate xhtml.
Yes, it's the input html. "Soup" is recipe shorthand for html being processed in a recipe by BeautifulSoup. The html processing methods (pre and post) are passed a parameter named "soup" which is the html they receive to be processed. You'll see it in most recipes. It's also used behind the scenes in all the tag handling methods.

http://www.crummy.com/software/Beaut...mentation.html
Starson17 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Feature Request: configurable space setting for "Insert blank line" in "Look & Feel" therealjoeblow Calibre 15 07-25-2011 03:14 PM
Yep. It's official. Sony Reader has "ruined" books for me. A final "review." WilliamG Sony Reader 48 01-14-2011 03:49 AM
Woher bekomme ich "Infinite Jest" oder "Unendlicher Spaß" von David Foster Wallace? bitschnau Erste Hilfe 3 11-01-2010 01:22 PM


All times are GMT -4. The time now is 10:39 AM.


MobileRead.com is a privately owned, operated and funded community.