After some
investigation, I discover that this DOCTPE declaration is causing my recipe to fail:
Code:
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
As you can see, there's an erroneous quote after PUBLIC.
So far, I've tried to solve the matter with this:
Code:
preprocess_regexps = [
(re.compile(r'<!DOCTYPE html .*strict.dtd">', re.DOTALL|re.IGNORECASE),
lambda match: '<!DOCTYPE html>'),
]
and this:
Code:
def parse_declaration(self, i):
"""Treat a bogus SGML declaration as raw data. Treat a CDATA
declaration as a CData object."""
j = None
if self.rawdata[i:i+9] == '<![CDATA[':
k = self.rawdata.find(']]>', i)
if k == -1:
k = len(self.rawdata)
data = self.rawdata[i+9:k]
j = k+3
self._toStringSubclass(data, CData)
else:
try:
j = SGMLParser.parse_declaration(self, i)
except SGMLParseError:
# Could not parse the DOCTYPE declaration
# Try to just skip the actual declaration
match = re.search(r'<!DOCTYPE([^>]*?)>', self.rawdata,
re.MULTILINE)
if match:
toHandle = self.rawdata[i:match.end()]
else:
toHandle = self.rawdata[i:]
self.handle_data(toHandle)
j = i + len(toHandle)
return j
But the result's the same:
Quote:
Python function terminated unexpectedly
No articles found, aborting (Error Code: 1)
|
Any ideas?