Looking at input/feed_0/article_0/index.html from your attachment, I see
Code:
<div id="hatFacebook" style="border: none;"><h4>WSJ on Facebook</h4><div style="border: none; padding: 2px 3px;" class="fb-like" data-href="http://www.facebook.com/wsj" data-send="false" data-layout="button_count" data-width="250" data-show-faces="false" data-action="recommend"></div></div>
this is most likely because the html parser incorrectly parsed something. So fixing the html in preprocess_raw_html might do the trick. The easiest way to fix it, is like this
Code:
def preprocess_raw_html(self, html, url):
import html5lib
root = html5lib.parse(html)
from lxml import etree
return etree.tostring(root, encoding=unicode)
Alternatively you can use regexps to nuke <meta>, <script>, <style> tags and comments which are most often the cause of parse errors.