![]() |
#1 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
|
"preprocess_regexps = [(re.compile..." bugged?
Hi Chaps.
Can someone confirm that preprocess_regexps = [ (re.compile(r'<head>.*</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')] OR preprocess_regexps = [ (re.compile(r'<head>.*?</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')] Should totally remove a downloaded pages <head> section. The reason I ask is this. I download the UK tabloid "The Sun" which on the whole works okay. Except, every so many articles have corruption above the navbar (i've attached an image showing it). I've checked the source of that text and it's in the line <meta name="og:title" content="Vincent Tabak: <br/>Hooked on hookers"/> BUT, this line is contained within the <head></head> section, so surely it should have been removed? Also, why does it appear above the navbar and not in the article? (Is this a bug?) The particular page in the example is http://www.thesun.co.uk/sol/homepage...n-hookers.html Help would really be appreciated |
![]() |
![]() |
![]() |
#2 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
|
I managed to remove the "junk" by a simple
remove_tags=[dict(name='head'), I'd really like to know why the reg exp command failed though... |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
|
correction. Using the remove tag <head>, calibre added a class at the end of each article web page that meant some css coding was being displayed at the end of the article as though it was article text - only on my prs300 - not in calibres viewer and not in firefox.
So I removed the remove tag <head> command and the garbage re-appears in the top navbar when viewed in calibre - but now displays okay on the prs300. It truly is a black art... Current version code is the one displaying correctly on sony. It's only a minor niggle about the bit of text appearing in the top nav bar but if Kovid, Starson et all could maybe offer a possible reason. I've been on it all day and can't figure out why. |
![]() |
![]() |
![]() |
#4 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Code:
preprocess_regexps = [(re.compile(r'<head.*</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')] |
|
![]() |
![]() |
![]() |
#5 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
|
Hi Starson.
Thanks for the reply. I thought that was the case - but as the attached image shows, it doesn't always work. Any idea why it ends up in the navbar and not in the article? |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
|
Are you sure that Google Reader is not breaking that section of text out of the header as it has a formatting element in it? I have no idea how the reader aggregation works - but I have a feeling it might be doing something funny there.
It seems that the <br/> is problematic, the regex run on the sun site itself works just fine. |
![]() |
![]() |
![]() |
#7 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Quote:
I'd definitely print the soup before and after preprocess_regexps to see what's coming in and whether it's getting processed correctly. |
||
![]() |
![]() |
![]() |
#8 | |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
|
Quote:
@starson when you say print the soup - I assume the soup is the input html and the created intermediate xhtml. I kinda get the idea of B.soup but the whole syntax etc is beyond me. I study your examples etc but even the way variables are declared is hard for a dummy like myself to get to grips with. (plus the mrs moans about the amount of time i'm spending on it) |
|
![]() |
![]() |
![]() |
#9 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
http://www.crummy.com/software/Beaut...mentation.html |
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Feature Request: configurable space setting for "Insert blank line" in "Look & Feel" | therealjoeblow | Calibre | 15 | 07-25-2011 03:14 PM |
Yep. It's official. Sony Reader has "ruined" books for me. A final "review." | WilliamG | Sony Reader | 48 | 01-14-2011 03:49 AM |
Woher bekomme ich "Infinite Jest" oder "Unendlicher Spaß" von David Foster Wallace? | bitschnau | Erste Hilfe | 3 | 11-01-2010 01:22 PM |