View Single Post
Old 10-07-2014, 11:37 PM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,450
Karma: 27757438
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Looking at input/feed_0/article_0/index.html from your attachment, I see

Code:
<div id="hatFacebook" style="border: none;">&lt;h4&gt;WSJ on Facebook&lt;/h4&gt;&lt;div style=&quot;border: none; padding: 2px 3px;&quot; class=&quot;fb-like&quot; data-href=&quot;http://www.facebook.com/wsj&quot; data-send=&quot;false&quot; data-layout=&quot;button_count&quot; data-width=&quot;250&quot; data-show-faces=&quot;false&quot; data-action=&quot;recommend&quot;&gt;&lt;/div&gt;</div>
this is most likely because the html parser incorrectly parsed something. So fixing the html in preprocess_raw_html might do the trick. The easiest way to fix it, is like this

Code:
def preprocess_raw_html(self, html, url):
     import html5lib
     root = html5lib.parse(html)
     from lxml import etree
     return etree.tostring(root, encoding=unicode)
Alternatively you can use regexps to nuke <meta>, <script>, <style> tags and comments which are most often the cause of parse errors.
kovidgoyal is online now   Reply With Quote