![]() |
#1 | |
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
|
Bad DOCTYPE declaration causes BS to crash
After some investigation, I discover that this DOCTPE declaration is causing my recipe to fail:
Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> So far, I've tried to solve the matter with this: Code:
preprocess_regexps = [ (re.compile(r'<!DOCTYPE html .*strict.dtd">', re.DOTALL|re.IGNORECASE), lambda match: '<!DOCTYPE html>'), ] Code:
def parse_declaration(self, i): """Treat a bogus SGML declaration as raw data. Treat a CDATA declaration as a CData object.""" j = None if self.rawdata[i:i+9] == '<![CDATA[': k = self.rawdata.find(']]>', i) if k == -1: k = len(self.rawdata) data = self.rawdata[i+9:k] j = k+3 self._toStringSubclass(data, CData) else: try: j = SGMLParser.parse_declaration(self, i) except SGMLParseError: # Could not parse the DOCTYPE declaration # Try to just skip the actual declaration match = re.search(r'<!DOCTYPE([^>]*?)>', self.rawdata, re.MULTILINE) if match: toHandle = self.rawdata[i:match.end()] else: toHandle = self.rawdata[i:] self.handle_data(toHandle) j = i + len(toHandle) return j Quote:
|
|
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,345
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Try this
preprocess_regexps= [(re.compile(r'<!DOCTYPE[^>]+>', re.I), '')] and note that you can also define preprocess_raw_html() i your recipe to remove the doctype programmitacally if you have trouble with regeps. |
![]() |
![]() |
![]() |
#3 | ||
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
|
Thanks, Kovid! (for everything).
Quote:
Quote:
Maybe I shoud clarify that until today I have zero experiencie with recipes, and only know something about HTML and Javascript. But I manage to make the recipe work with a local file, removing manually the DOCTYPE declaration in the index file. BTW, here's the recipe: Spoiler:
If you could tell where and how to try your suggestions... Meanwhile, I wrote to the webmaster's newspaper about the mistake. No answer as for today. ;-( |
||
![]() |
![]() |
![]() |
#4 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,345
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Just stick the regexp in your recipe as
Code:
preprocess_regexps= [(re.compile(r'<!DOCTYPE[^>]+>', re.I), lambda m:'')] |
![]() |
![]() |
![]() |
#5 |
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
|
Didn't work. "Downloaded HTML" includes the index file?. 'Cause that's the one causing the problem, in fact.
|
![]() |
![]() |
![]() |
#6 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,345
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
It includes all downloaded html.
|
![]() |
![]() |
![]() |
#7 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,345
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Sorry, if you mean the index page as in the page used in parse_index, then no it doesn't apply. In that case you have to do it manually.
Code:
raw = self.index_to_soup(index_url, raw=True) raw = re.sub(r'(?i)<!DOCTYPE[^>]+>', '', raw) soup = self.index_to_soup(raw) |
![]() |
![]() |
![]() |
#8 |
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
|
![]() SOLVED! I'll post the new recipe in the appropiate section. |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Proper Unicode Declaration | Fabe | Sigil | 9 | 10-13-2010 01:42 PM |
Namespace declaration | ChrisI | Sigil | 1 | 08-22-2010 06:02 AM |
Encoding declaration in OPF and TOC? | paulpeer | Sigil | 7 | 03-08-2010 03:48 PM |
Declaration of Independence | bill the smith | News | 140 | 10-02-2009 05:01 PM |
Government United States: Declaration of Independence etc, v1, 21 Oct 2007. | Patricia | BBeB/LRF Books | 2 | 10-21-2007 09:37 PM |