|
|
#1 | |
|
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
|
Bad DOCTYPE declaration causes BS to crash
After some investigation, I discover that this DOCTPE declaration is causing my recipe to fail:
Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> So far, I've tried to solve the matter with this: Code:
preprocess_regexps = [ (re.compile(r'<!DOCTYPE html .*strict.dtd">', re.DOTALL|re.IGNORECASE), lambda match: '<!DOCTYPE html>'), ] Code:
def parse_declaration(self, i):
"""Treat a bogus SGML declaration as raw data. Treat a CDATA
declaration as a CData object."""
j = None
if self.rawdata[i:i+9] == '<![CDATA[':
k = self.rawdata.find(']]>', i)
if k == -1:
k = len(self.rawdata)
data = self.rawdata[i+9:k]
j = k+3
self._toStringSubclass(data, CData)
else:
try:
j = SGMLParser.parse_declaration(self, i)
except SGMLParseError:
# Could not parse the DOCTYPE declaration
# Try to just skip the actual declaration
match = re.search(r'<!DOCTYPE([^>]*?)>', self.rawdata,
re.MULTILINE)
if match:
toHandle = self.rawdata[i:match.end()]
else:
toHandle = self.rawdata[i:]
self.handle_data(toHandle)
j = i + len(toHandle)
return j
Quote:
|
|
|
|
|
|
|
#2 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,598
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Try this
preprocess_regexps= [(re.compile(r'<!DOCTYPE[^>]+>', re.I), '')] and note that you can also define preprocess_raw_html() i your recipe to remove the doctype programmitacally if you have trouble with regeps. |
|
|
|
|
|
#3 | ||
|
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
|
Thanks, Kovid! (for everything).
Quote:
Quote:
Maybe I shoud clarify that until today I have zero experiencie with recipes, and only know something about HTML and Javascript. But I manage to make the recipe work with a local file, removing manually the DOCTYPE declaration in the index file. BTW, here's the recipe: Spoiler:
If you could tell where and how to try your suggestions... Meanwhile, I wrote to the webmaster's newspaper about the mistake. No answer as for today. ;-( |
||
|
|
|
|
|
#4 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,598
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Just stick the regexp in your recipe as
Code:
preprocess_regexps= [(re.compile(r'<!DOCTYPE[^>]+>', re.I), lambda m:'')] |
|
|
|
|
|
#5 |
|
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
|
Didn't work. "Downloaded HTML" includes the index file?. 'Cause that's the one causing the problem, in fact.
|
|
|
|
|
|
#6 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,598
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
It includes all downloaded html.
|
|
|
|
|
|
#7 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,598
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Sorry, if you mean the index page as in the page used in parse_index, then no it doesn't apply. In that case you have to do it manually.
Code:
raw = self.index_to_soup(index_url, raw=True) raw = re.sub(r'(?i)<!DOCTYPE[^>]+>', '', raw) soup = self.index_to_soup(raw) |
|
|
|
|
|
#8 |
|
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
|
![]() SOLVED! I'll post the new recipe in the appropiate section. |
|
|
|
![]() |
| Thread Tools | Search this Thread |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Proper Unicode Declaration | Fabe | Sigil | 9 | 10-13-2010 01:42 PM |
| Namespace declaration | ChrisI | Sigil | 1 | 08-22-2010 06:02 AM |
| Encoding declaration in OPF and TOC? | paulpeer | Sigil | 7 | 03-08-2010 03:48 PM |
| Declaration of Independence | bill the smith | News | 140 | 10-02-2009 05:01 PM |
| Government United States: Declaration of Independence etc, v1, 21 Oct 2007. | Patricia | BBeB/LRF Books | 2 | 10-21-2007 09:37 PM |