| 
			
			 | 
		#1 | |
| 
			
			
			
			 Enthusiast 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25 
				Karma: 1896 
				Join Date: Aug 2011 
				
				
				
				Device: Kindle 3 
				
				
				 | 
	
	
	
		
		
			
			 
				
				Bad DOCTYPE declaration causes BS to crash
			 
			
			
			After some investigation, I discover that this DOCTPE declaration is causing my recipe to fail: 
		
	
		
		
		
		
		
		
		
		
		
		
	
	Code: 
	<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> So far, I've tried to solve the matter with this: Code: 
	preprocess_regexps = [ (re.compile(r'<!DOCTYPE html .*strict.dtd">', re.DOTALL|re.IGNORECASE), lambda match: '<!DOCTYPE html>'), ] Code: 
	    def parse_declaration(self, i):
        """Treat a bogus SGML declaration as raw data. Treat a CDATA
        declaration as a CData object."""
        j = None
        if self.rawdata[i:i+9] == '<![CDATA[':
             k = self.rawdata.find(']]>', i)
             if k == -1:
                 k = len(self.rawdata)
             data = self.rawdata[i+9:k]
             j = k+3
             self._toStringSubclass(data, CData)
        else:
            try:
                j = SGMLParser.parse_declaration(self, i)
            except SGMLParseError:
                # Could not parse the DOCTYPE declaration
                # Try to just skip the actual declaration
                match = re.search(r'<!DOCTYPE([^>]*?)>', self.rawdata,
                re.MULTILINE)
                if match:
                    toHandle = self.rawdata[i:match.end()]
                else:
                    toHandle = self.rawdata[i:]
                self.handle_data(toHandle)
                j = i + len(toHandle)
        return j
Quote: 
	
  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#2 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Try this 
		
	
		
		
		
		
		
		
		
		
		
		
	
	preprocess_regexps= [(re.compile(r'<!DOCTYPE[^>]+>', re.I), '')] and note that you can also define preprocess_raw_html() i your recipe to remove the doctype programmitacally if you have trouble with regeps.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| Advert | |
| 
         | 
    
| 
			
			 | 
		#3 | ||
| 
			
			
			
			 Enthusiast 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25 
				Karma: 1896 
				Join Date: Aug 2011 
				
				
				
				Device: Kindle 3 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Thanks, Kovid! (for everything). 
		
	
		
		
		
		
		
		
		
		
		
		
	
	Quote: 
	
 Quote: 
	
 Maybe I shoud clarify that until today I have zero experiencie with recipes, and only know something about HTML and Javascript. But I manage to make the recipe work with a local file, removing manually the DOCTYPE declaration in the index file. BTW, here's the recipe: Spoiler: 
 If you could tell where and how to try your suggestions... Meanwhile, I wrote to the webmaster's newspaper about the mistake. No answer as for today. ;-(  | 
||
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#4 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Just stick the regexp in your recipe as 
		
	
		
		
		
		
		
		
		
		
		
		
	
	Code: 
	preprocess_regexps= [(re.compile(r'<!DOCTYPE[^>]+>', re.I), lambda m:'')]  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#5 | 
| 
			
			
			
			 Enthusiast 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25 
				Karma: 1896 
				Join Date: Aug 2011 
				
				
				
				Device: Kindle 3 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Didn't work. "Downloaded HTML" includes the index file?. 'Cause that's the one causing the problem, in fact.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| Advert | |
| 
         | 
    
| 
			
			 | 
		#6 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			It includes all downloaded html.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#7 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Sorry, if you mean the index page as in the page used in parse_index, then no it doesn't apply. In that case you have to do it manually. 
		
	
		
		
		
		
		
		
		
		
		
		
	
	Code: 
	raw = self.index_to_soup(index_url, raw=True) raw = re.sub(r'(?i)<!DOCTYPE[^>]+>', '', raw) soup = self.index_to_soup(raw)  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#8 | 
| 
			
			
			
			 Enthusiast 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25 
				Karma: 1896 
				Join Date: Aug 2011 
				
				
				
				Device: Kindle 3 
				
				
				 | 
	
	
	
		
		
		
		
		![]() SOLVED! I'll post the new recipe in the appropiate section.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
![]()  | 
            
        
    
            
  | 
    
			 
			Similar Threads
		 | 
	||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Proper Unicode Declaration | Fabe | Sigil | 9 | 10-13-2010 02:42 PM | 
| Namespace declaration | ChrisI | Sigil | 1 | 08-22-2010 07:02 AM | 
| Encoding declaration in OPF and TOC? | paulpeer | Sigil | 7 | 03-08-2010 04:48 PM | 
| Declaration of Independence | bill the smith | News | 140 | 10-02-2009 06:01 PM | 
| Government United States: Declaration of Independence etc, v1, 21 Oct 2007. | Patricia | BBeB/LRF Books | 2 | 10-21-2007 10:37 PM |