| 
			
			 | 
		#1 | 
| 
			
			
			
			 Addict 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 241 
				Karma: 1001369 
				Join Date: Sep 2010 
				
				
				
				Device: prs300, kindle keyboard 3g 
				
				
				 | 
	
	
	
		
		
			
			 
				
				"preprocess_regexps = [(re.compile..." bugged?
			 
			
			
			Hi Chaps. 
		
	
		
		
			Can someone confirm that preprocess_regexps = [ (re.compile(r'<head>.*</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')] OR preprocess_regexps = [ (re.compile(r'<head>.*?</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')] Should totally remove a downloaded pages <head> section. The reason I ask is this. I download the UK tabloid "The Sun" which on the whole works okay. Except, every so many articles have corruption above the navbar (i've attached an image showing it). I've checked the source of that text and it's in the line <meta name="og:title" content="Vincent Tabak: <br/>Hooked on hookers"/> BUT, this line is contained within the <head></head> section, so surely it should have been removed? Also, why does it appear above the navbar and not in the article? (Is this a bug?) The particular page in the example is http://www.thesun.co.uk/sol/homepage...n-hookers.html Help would really be appreciated  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#2 | 
| 
			
			
			
			 Addict 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 241 
				Karma: 1001369 
				Join Date: Sep 2010 
				
				
				
				Device: prs300, kindle keyboard 3g 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			I managed to remove the "junk" by a simple  
		
	
		
		
		
		
		
		
		
		
		
		
	
	remove_tags=[dict(name='head'), I'd really like to know why the reg exp command failed though...  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| Advert | |
| 
         | 
    
| 
			
			 | 
		#3 | 
| 
			
			
			
			 Addict 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 241 
				Karma: 1001369 
				Join Date: Sep 2010 
				
				
				
				Device: prs300, kindle keyboard 3g 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			correction. Using the remove tag <head>, calibre added a class at the end of each article web page that meant some css coding was being displayed at the end of the article as though it was article text - only on my prs300 - not in calibres viewer and not in firefox. 
		
	
		
		
		
		
		
		
		
		
		
		
	
	So I removed the remove tag <head> command and the garbage re-appears in the top navbar when viewed in calibre - but now displays okay on the prs300. It truly is a black art... Current version code is the one displaying correctly on sony. It's only a minor niggle about the bit of text appearing in the top nav bar but if Kovid, Starson et all could maybe offer a possible reason. I've been on it all day and can't figure out why.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#4 | |
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004 
				Karma: 177841 
				Join Date: Dec 2009 
				
				
				
				Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 Code: 
	preprocess_regexps = [(re.compile(r'<head.*</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')]  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#5 | 
| 
			
			
			
			 Addict 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 241 
				Karma: 1001369 
				Join Date: Sep 2010 
				
				
				
				Device: prs300, kindle keyboard 3g 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Hi Starson. 
		
	
		
		
		
		
		
		
		
		
		
		
	
	Thanks for the reply. I thought that was the case - but as the attached image shows, it doesn't always work. Any idea why it ends up in the navbar and not in the article?  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| Advert | |
| 
         | 
    
| 
			
			 | 
		#6 | 
| 
			
			
			
			 Evangelist 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 416 
				Karma: 1045911 
				Join Date: Sep 2011 
				Location: Cape Town, South Africa 
				
				
				Device: Kindle 3 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Are you sure that Google Reader is not breaking that section of text out of the header as it has a formatting element in it? I have no idea how the reader aggregation works - but I have a feeling it might be doing something funny there. 
		
	
		
		
		
		
		
		
		
		
		
		
	
	It seems that the <br/> is problematic, the regex run on the sun site itself works just fine.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#7 | ||
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004 
				Karma: 177841 
				Join Date: Dec 2009 
				
				
				
				Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 Quote: 
	
 I'd definitely print the soup before and after preprocess_regexps to see what's coming in and whether it's getting processed correctly.  | 
||
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#8 | |
| 
			
			
			
			 Addict 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 241 
				Karma: 1001369 
				Join Date: Sep 2010 
				
				
				
				Device: prs300, kindle keyboard 3g 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 @starson when you say print the soup - I assume the soup is the input html and the created intermediate xhtml. I kinda get the idea of B.soup but the whole syntax etc is beyond me. I study your examples etc but even the way variables are declared is hard for a dummy like myself to get to grips with. (plus the mrs moans about the amount of time i'm spending on it)  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#9 | |
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004 
				Karma: 177841 
				Join Date: Dec 2009 
				
				
				
				Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 http://www.crummy.com/software/Beaut...mentation.html  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
![]()  | 
            
        
    
| Thread Tools | Search this Thread | 
            
  | 
    
			 
			Similar Threads
		 | 
	||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Feature Request: configurable space setting for "Insert blank line" in "Look & Feel" | therealjoeblow | Calibre | 15 | 07-25-2011 04:14 PM | 
| Yep. It's official. Sony Reader has "ruined" books for me. A final "review." | WilliamG | Sony Reader | 48 | 01-14-2011 04:49 AM | 
| Woher bekomme ich "Infinite Jest" oder "Unendlicher Spaß" von David Foster Wallace? | bitschnau | Erste Hilfe | 3 | 11-01-2010 02:22 PM |