|  09-03-2011, 05:56 PM | #1 | |
| Enthusiast            Posts: 25 Karma: 1896 Join Date: Aug 2011 Device: Kindle 3 | 
				
				Bad DOCTYPE declaration causes BS to crash
			 
			
			After some investigation, I discover that this DOCTPE declaration is causing my recipe to fail: Code: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> So far, I've tried to solve the matter with this: Code: preprocess_regexps = [ (re.compile(r'<!DOCTYPE html .*strict.dtd">', re.DOTALL|re.IGNORECASE), lambda match: '<!DOCTYPE html>'), ] Code:     def parse_declaration(self, i):
        """Treat a bogus SGML declaration as raw data. Treat a CDATA
        declaration as a CData object."""
        j = None
        if self.rawdata[i:i+9] == '<![CDATA[':
             k = self.rawdata.find(']]>', i)
             if k == -1:
                 k = len(self.rawdata)
             data = self.rawdata[i+9:k]
             j = k+3
             self._toStringSubclass(data, CData)
        else:
            try:
                j = SGMLParser.parse_declaration(self, i)
            except SGMLParseError:
                # Could not parse the DOCTYPE declaration
                # Try to just skip the actual declaration
                match = re.search(r'<!DOCTYPE([^>]*?)>', self.rawdata,
                re.MULTILINE)
                if match:
                    toHandle = self.rawdata[i:match.end()]
                else:
                    toHandle = self.rawdata[i:]
                self.handle_data(toHandle)
                j = i + len(toHandle)
        return jQuote: 
 | |
|   |   | 
|  09-03-2011, 07:44 PM | #2 | 
| creator of calibre            Posts: 45,604 Karma: 28548974 Join Date: Oct 2006 Location: Mumbai, India Device: Various | 
			
			Try this preprocess_regexps= [(re.compile(r'<!DOCTYPE[^>]+>', re.I), '')] and note that you can also define preprocess_raw_html() i your recipe to remove the doctype programmitacally if you have trouble with regeps. | 
|   |   | 
| Advert | |
|  | 
|  09-04-2011, 02:26 AM | #3 | ||
| Enthusiast            Posts: 25 Karma: 1896 Join Date: Aug 2011 Device: Kindle 3 | 
			
			Thanks, Kovid! (for everything). Quote: 
 Quote: 
 Maybe I shoud clarify that until today I have zero experiencie with recipes, and only know something about HTML and Javascript. But I manage to make the recipe work with a local file, removing manually the DOCTYPE declaration in the index file. BTW, here's the recipe: Spoiler: 
 If you could tell where and how to try your suggestions... Meanwhile, I wrote to the webmaster's newspaper about the mistake. No answer as for today. ;-( | ||
|   |   | 
|  09-04-2011, 02:30 AM | #4 | 
| creator of calibre            Posts: 45,604 Karma: 28548974 Join Date: Oct 2006 Location: Mumbai, India Device: Various | 
			
			Just stick the regexp in your recipe as Code: preprocess_regexps= [(re.compile(r'<!DOCTYPE[^>]+>', re.I), lambda m:'')] | 
|   |   | 
|  09-04-2011, 02:30 PM | #5 | 
| Enthusiast            Posts: 25 Karma: 1896 Join Date: Aug 2011 Device: Kindle 3 | 
			
			Didn't work. "Downloaded HTML" includes the index file?. 'Cause that's the one causing the problem, in fact.
		 | 
|   |   | 
| Advert | |
|  | 
|  09-04-2011, 02:44 PM | #6 | 
| creator of calibre            Posts: 45,604 Karma: 28548974 Join Date: Oct 2006 Location: Mumbai, India Device: Various | 
			
			It includes all downloaded html.
		 | 
|   |   | 
|  09-04-2011, 03:01 PM | #7 | 
| creator of calibre            Posts: 45,604 Karma: 28548974 Join Date: Oct 2006 Location: Mumbai, India Device: Various | 
			
			Sorry, if you mean the index page as in the page used in parse_index, then no it doesn't apply. In that case you have to do it manually. Code: raw = self.index_to_soup(index_url, raw=True) raw = re.sub(r'(?i)<!DOCTYPE[^>]+>', '', raw) soup = self.index_to_soup(raw) | 
|   |   | 
|  09-04-2011, 03:41 PM | #8 | 
| Enthusiast            Posts: 25 Karma: 1896 Join Date: Aug 2011 Device: Kindle 3 |  SOLVED! I'll post the new recipe in the appropiate section. | 
|   |   | 
|  | 
| 
 | 
|  Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Proper Unicode Declaration | Fabe | Sigil | 9 | 10-13-2010 01:42 PM | 
| Namespace declaration | ChrisI | Sigil | 1 | 08-22-2010 06:02 AM | 
| Encoding declaration in OPF and TOC? | paulpeer | Sigil | 7 | 03-08-2010 03:48 PM | 
| Declaration of Independence | bill the smith | News | 140 | 10-02-2009 05:01 PM | 
| Government United States: Declaration of Independence etc, v1, 21 Oct 2007. | Patricia | BBeB/LRF Books | 2 | 10-21-2007 09:37 PM |