View Single Post
Old 09-04-2011, 02:30 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,377
Karma: 27230406
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Just stick the regexp in your recipe as

Code:
preprocess_regexps= [(re.compile(r'<!DOCTYPE[^>]+>', re.I), lambda m:'')]
That should strip any doctype declarations from downloaded HTML.
kovidgoyal is offline   Reply With Quote