MobileRead Forums - View Single Post - Wall Street Journal--feedparser error?

kovidgoyal · 10-07-2014, 11:37 PM

Looking at input/feed_0/article_0/index.html from your attachment, I see

Code:

<div id="hatFacebook" style="border: none;">&lt;h4&gt;WSJ on Facebook&lt;/h4&gt;&lt;div style=&quot;border: none; padding: 2px 3px;&quot; class=&quot;fb-like&quot; data-href=&quot;http://www.facebook.com/wsj&quot; data-send=&quot;false&quot; data-layout=&quot;button_count&quot; data-width=&quot;250&quot; data-show-faces=&quot;false&quot; data-action=&quot;recommend&quot;&gt;&lt;/div&gt;</div>

this is most likely because the html parser incorrectly parsed something. So fixing the html in preprocess_raw_html might do the trick. The easiest way to fix it, is like this

Code:

def preprocess_raw_html(self, html, url):
     import html5lib
     root = html5lib.parse(html)
     from lxml import etree
     return etree.tostring(root, encoding=unicode)

Alternatively you can use regexps to nuke <meta>, <script>, <style> tags and comments which are most often the cause of parse errors.

10-07-2014, 11:37 PM	#8
kovidgoyal creator of calibre Posts: 45,455 Karma: 27757438 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Looking at input/feed_0/article_0/index.html from your attachment, I see Code: <div id="hatFacebook" style="border: none;"><h4>WSJ on Facebook</h4><div style="border: none; padding: 2px 3px;" class="fb-like" data-href="http://www.facebook.com/wsj" data-send="false" data-layout="button_count" data-width="250" data-show-faces="false" data-action="recommend"></div></div> this is most likely because the html parser incorrectly parsed something. So fixing the html in preprocess_raw_html might do the trick. The easiest way to fix it, is like this Code: def preprocess_raw_html(self, html, url): import html5lib root = html5lib.parse(html) from lxml import etree return etree.tostring(root, encoding=unicode) Alternatively you can use regexps to nuke <meta>, <script>, <style> tags and comments which are most often the cause of parse errors.