Wall Street Journal--feedparser error?

dkfurrow · 10-06-2014, 02:11 PM

Thought I'd start a new thread for this...Appears to be an different issue than the previous wsj thread.

I started getting errors on WSJ starting Saturday. parse_index works correctly, but when articles are parsed, each returns an error of "Initial parse failed, using more forgiving parsers", resulting in an epub with only empty articles.

A quick search revealed that error message originates with feedparser...I'm guessing the solution is then to alter the downloaded html in some manner in order to conform to feedparser, but I'm not sure how to do this.

Logfile attached. Any advice would be greatly appreciated.

Thanks,
Dale

kovidgoyal · 10-07-2014, 12:00 AM

If the output has empty articles it means the website's markup has changed and the recipe's keep_only_tags most likely needs to be adjusted. Initial parse failed comes from an html parser and is not relevant,the fallback parser is perfectly capable of parsing whatever markup wsj throws at it.

kovidgoyal · 10-07-2014, 12:02 AM

I tried it and it seems to be returning full articles for me, see attached extract which contains only the first four articles. Given the other thread on parse_index() failures I'm guessing the WSJ is in the middle of some kind of phased rollout of changes to their website.

dkfurrow · 10-07-2014, 07:52 AM

Thanks for the reply. Yeah I got similar results for my run...you're right, they must be doing some changes.

The only issue I had was the opinion section still not downloading. I know you put in a fix for that a few weeks ago, and the articles in question contain the tag with "article-contents" id, but for some reason it's not working. I tried a few other combinations of tags for the "keep-only" list, but still couldn't get it to work (except by removing keep_only, which puts too much extra stuff in). A sample parse_index function with two articles (one that works, one that doesn't) is below. Do you get similar results?

Code:

def parse_index(self):
        feeds = []
        articles = []
        # will parse
        title1 = 'HP Article WSJ'
        desc1 = 'about Hewlett Packard'
        url1 = 'http://online.wsj.com/articles/hewlett-packard-split-comes-as-more-investors-say-big-isnt-better-1412643100'
        articles.append({'title':title1, 'url':url1, 'description':desc1, 'date':''})

        # won't parse
        title = "Stephens Article in WSJ"
        desc = 'china bubble story'
        url = 'http://online.wsj.com/articles/bret-stephens-hong-kong-pops-the-china-bubble-1412636585'
        articles.append({'title':title, 'url':url, 'description':desc, 'date':''})


        for article in articles:
            print "title:", article['title']
        section = "This Sample Section"
        feeds.append((section, articles))
        return feeds

kovidgoyal · 10-07-2014, 08:21 AM

No, both work for me, see attached. Use the --debug-pipeline option to see exactly what HTML is being downloaded with no keep_only_tags. That should help you figure out why it is not working.

dkfurrow · 10-07-2014, 05:25 PM

Hmmm....still can’t get it to work. Attached is the zip file with ‘keep_only’ removed, plus a log file and the raw html. I’ve successfully setup the development environment (Windows) with the intent to get some detail on keep_only usage (should be in RecursiveFetcher class, right?), but can’t get a basic print statement to work from that class. Beyond that, I know:

(1) Even narrowing keep_only to dict(name='article', id='article-contents') didn’t work.

(2) Whatever the problem is, it occurs before the ‘input’ stage.

(3) I see that, for the article which parsed correctly, in input, keep_only removed, a <div> tag replaces the raw html <article> tag, with the same attributes. For the article which didn’t parse, there’s no corresponding <div> tag. Probably worth noting that notepad++ recognizes the <article> tag in the raw file of the one that parsed, but not for the other.

That’s about all I have been able to figure out, sorry if I'm missing something obvious...hard to see what machine or implementation issues may be at work here. The logfile has a JS Browser statement (below) that I’m not familiar with, but other than that any advice you could give on getting some more detail on the ‘pre-input’ processing would be helpful.

JSBrowser msg():https://a248.e.akamai.net/f/248/6767...11505143897:1: Porthole: Using built-in browser support

Thanks,
Dale

kovidgoyal · 10-07-2014, 11:29 PM

You can look at the raw html, before it is parsed by the recursivefetcher class by implementing in your recipe class.

Code:

def preprocess_raw_html(self, raw_html, url):
    open('/some/temp/file', 'wb').write(raw_html)
    return raw_html

then the raw html will be saved to the temp file you chose above.

Then you can see the html after keep_only etc have run by implementing

Code:

def preprocess_html(self, soup):
   open('/some/temp/file2', 'wb').write(str(soup))
   return soup

If yo want to debug the operation of keep_only tags, in fetch/simple.py in the get_soup() method add some lines like self.log('whatever you want')

JSBrowser comes from the use of a ful webkit browser with javascript support to do login, which the WSJ requires. It is only used for login, nothing else.

kovidgoyal · 10-07-2014, 11:37 PM

Looking at input/feed_0/article_0/index.html from your attachment, I see

Code:

<div id="hatFacebook" style="border: none;">&lt;h4&gt;WSJ on Facebook&lt;/h4&gt;&lt;div style=&quot;border: none; padding: 2px 3px;&quot; class=&quot;fb-like&quot; data-href=&quot;http://www.facebook.com/wsj&quot; data-send=&quot;false&quot; data-layout=&quot;button_count&quot; data-width=&quot;250&quot; data-show-faces=&quot;false&quot; data-action=&quot;recommend&quot;&gt;&lt;/div&gt;</div>

this is most likely because the html parser incorrectly parsed something. So fixing the html in preprocess_raw_html might do the trick. The easiest way to fix it, is like this

Code:

def preprocess_raw_html(self, html, url):
     import html5lib
     root = html5lib.parse(html)
     from lxml import etree
     return etree.tostring(root, encoding=unicode)

Alternatively you can use regexps to nuke <meta>, <script>, <style> tags and comments which are most often the cause of parse errors.

kovidgoyal · 10-07-2014, 11:42 PM

See if this commit helps:

https://github.com/kovidgoyal/calibr...d199d96fb31be8

dkfurrow · 10-08-2014, 09:38 AM

The last update gives me the same results...the only article that shows in the opinion section is "Corections & Amplifications"...I think the way to go is the preprocess route...it appears to me that the new format must be invalid html in some way. I'll give that a try today.

dkfurrow · 10-08-2014, 11:10 PM

The plot thickens...back to the two articles, hereafter 'HP_Article' and 'Opinion_Article'...I tried html5lib in preprocess, HP_Article downloaded, and Opinion_Article did not (there was an error in the ihatexml.py file in html5lib...not sure that was related).

so I tried parsing the raw data with lxml, isolating the <article> tag, reconstituting the html and passing it out...same result. Not sure if there's further cleaning required here or something else...it seems to me if the html issues directly from lxml (as in this case), it oughta work...clearly that's wrong. Recipe below, zipfile attached has logs, raw html, reprocessed html and epub file.

Code:

#!/usr/bin/env  python
__license__   = 'GPL v3'
__copyright__ = '2008, Kovid Goyal kovid@kovidgoyal.net'
__docformat__ = 'restructuredtext en'

from calibre.web.feeds.news import BasicNewsRecipe
import html5lib
from lxml import etree, html
from lxml.html import builder as E
import copy

# http://online.wsj.com/page/us_in_todays_paper.html

class WallStreetJournal(BasicNewsRecipe):

    title = 'The Wall Street Journal'
    __author__ = 'Kovid Goyal and Joshua Oster-Morris'
    description = 'News and current affairs'
    needs_subscription = True
    language = 'en'

    compress_news_images = True
    compress_news_images_auto_size = 5
    max_articles_per_feed = 1000
    timefmt  = ' [%a, %b %d, %Y]'
    no_stylesheets = True
    ignore_duplicate_articles = {'url'}
    suffix_dict = {'1412643100': 'HP_ARTICLE', '1412636585': 'Opinion_Article'}
    print_files = True
    print_file_loc = 'E:\\Temp\\wsjTest\\'

    keep_only_tags = [
        dict(name='h1'), dict(name='h2', attrs={'class':['subhead', 'subHed deck']}),
        dict(name='span', itemprop='author', rel='author'),
        dict(name='article', id=['article-contents', 'articleBody']),
        dict(name='div', id='article_story_body'),
        dict(name='div', attrs={'class':'snippet-ad-login'}),
        dict(name='div', attrs={'data-module-name':'resp.module.article.articleBody'}),
    ]

    def preprocess_raw_html(self, raw_html, url):
        # root = html5lib.parse(raw_html, treebuilder='lxml', namespaceHTMLElements=False)
        html_parser = etree.HTMLParser()
        html_parsed = etree.fromstring(raw_html, parser=html_parser)
        selected = html_parsed.xpath("""//article[@id=('article-contents' or 'articleBody')]""")
        html_out = E.HTML(E.BODY(selected[0]))
        self.log( "Preprocessing URL:",  url)
        name = self.suffix_dict[url.split("-")[-1:][0]]
        output = etree.tostring(html_out)
        if self.print_files:
            open(self.print_file_loc + name + '-raw.html', 'wb').write(raw_html)
            open(self.print_file_loc + name + '-preprocessed.html', 'wb').write(output)
        return output

    remove_tags = [
        dict(attrs={'class':['insetButton', 'insettipBox']}),
        dict(name='span', attrs={'data-country-code':True, 'data-ticker-code':True}),
    ]

    use_javascript_to_login = True

    def javascript_login(self, br, username, password):
        br.visit('https://id.wsj.com/access/pages/wsj/us/login_standalone.html?mg=com-wsj', timeout=120)
        f = br.select_form(nr=0)
        f['username'] = username
        f['password'] = password
        br.submit(timeout=120)

    def populate_article_metadata(self, article, soup, first):
        if first and hasattr(self, 'add_toc_thumbnail'):
            picdiv = soup.find('img')
            if picdiv is not None:
                self.add_toc_thumbnail(article,picdiv['src'])

    def preprocess_html(self, soup):
        # Remove thumbnail for zoomable images
        for div in soup.findAll('div', attrs={'class':lambda x: x and 'insetZoomTargetBox' in x.split()}):
            img = div.find('img')
            if img is not None:
                img.extract()
        return soup

    def parse_index(self):
        feeds = []
        articles = []
        # will parse
        title1 = 'HP_Article'
        desc1 = 'A News Article about Hewlett Packard'
        url1 = 'http://online.wsj.com/articles/hewlett-packard-split-comes-as-more-investors-say-big-isnt-better-1412643100'
        articles.append({'title':title1, 'url':url1, 'description':desc1, 'date':''})
        # won't parse
        title = "Opinion_Article"
        desc = 'An Opinion Article about China Bubble'
        url = 'http://online.wsj.com/articles/bret-stephens-hong-kong-pops-the-china-bubble-1412636585'
        articles.append({'title':title, 'url':url, 'description':desc, 'date':''})
        # bundle and return
        section = "This Sample Section"
        feeds.append((section, articles))
        return feeds

    def cleanup(self):
        self.browser.open('http://online.wsj.com/logout?url=http://online.wsj.com')

kovidgoyal · 10-09-2014, 12:26 AM

There are lots of invalid comments in that raw html, for example,

Code:

    <!--[if lte IE 8]>
  
<div data-module-id="6" data-module-name="article.app/lib/module/ieWarning" data-module-zone="ie_warning" class="zonedModule">
<div class="ie-flag">
  <button class="ie-button"></button>

  <div class="ie-warning-wrapper">
    <p><span id="warning-label">BROWSER UPDATE</span> To gain access to the full experience, please upgrade your browser: </p>
    <ul>
      <li><a href="https://www.google.com/intl/en_us/chrome/browser/">Chrome</a> | </li>
      <li><a href="http://support.apple.com/downloads/#safari">Safari</a> | </li>
      <li><a href="https://www.mozilla.org/en-US/firefox/new/">Firefox</a> | </li>
      <li><a href="http://windows.microsoft.com/en-us/internet-explorer/download-ie">Internet Explorer</a></li>
    </ul><br>
    <p><span id="warning-note">Note: If you are running Internet Explorer 9 and above, make sure it is not in compatibility mode</span></p>
  </div>
</div>

</div> <!-- data-module-name="article.app/lib/module/ieWarning" -->

    <![endif]-->

Note the improper nesting of comments. And then this:

Code:

         <![if ! lte IE 8]>
        <span class="image-enlarge">
          ENLARGE
        </span>
        <![endif]>

The following should take care of it:

Code:

    preprocess_regexps = [
        (re.compile(r'<!--\[if lte IE 8\]>.+?<!\[endif\]-->', re.DOTALL), lambda m: ''),
        (re.compile(r'<!\[if ! lte IE 8\]>.+?<!\[endif\]>', re.DOTALL), lambda m:''),
    ]

kovidgoyal · 10-09-2014, 12:50 AM

I have committed code to strip those broken comments and improve processing of the new style markup found in Opinion_Article-raw.html

https://github.com/kovidgoyal/calibr...fe6b51dfa131a0

dkfurrow · 10-09-2014, 01:41 PM

That works! Thanks for the fix, that cleaned it up nicely. Guess I learned something about the limits of lxml here...Sorry I wan't able to add more value on this one.

BobbyVan · 10-09-2014, 09:39 PM

Thank you, Kovid. The new recipe works for me as well.

10-07-2014, 07:52 AM	#4
dkfurrow Member Posts: 13 Karma: 10 Join Date: Jun 2013 Device: LG G-Pad 8.3	Thanks for the reply. Yeah I got similar results for my run...you're right, they must be doing some changes. The only issue I had was the opinion section still not downloading. I know you put in a fix for that a few weeks ago, and the articles in question contain the tag with "article-contents" id, but for some reason it's not working. I tried a few other combinations of tags for the "keep-only" list, but still couldn't get it to work (except by removing keep_only, which puts too much extra stuff in). A sample parse_index function with two articles (one that works, one that doesn't) is below. Do you get similar results? Code: def parse_index(self): feeds = [] articles = [] # will parse title1 = 'HP Article WSJ' desc1 = 'about Hewlett Packard' url1 = 'http://online.wsj.com/articles/hewlett-packard-split-comes-as-more-investors-say-big-isnt-better-1412643100' articles.append({'title':title1, 'url':url1, 'description':desc1, 'date':''}) # won't parse title = "Stephens Article in WSJ" desc = 'china bubble story' url = 'http://online.wsj.com/articles/bret-stephens-hong-kong-pops-the-china-bubble-1412636585' articles.append({'title':title, 'url':url, 'description':desc, 'date':''}) for article in articles: print "title:", article['title'] section = "This Sample Section" feeds.append((section, articles)) return feeds

10-07-2014, 11:29 PM	#7
kovidgoyal creator of calibre Posts: 45,337 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You can look at the raw html, before it is parsed by the recursivefetcher class by implementing in your recipe class. Code: def preprocess_raw_html(self, raw_html, url): open('/some/temp/file', 'wb').write(raw_html) return raw_html then the raw html will be saved to the temp file you chose above. Then you can see the html after keep_only etc have run by implementing Code: def preprocess_html(self, soup): open('/some/temp/file2', 'wb').write(str(soup)) return soup If yo want to debug the operation of keep_only tags, in fetch/simple.py in the get_soup() method add some lines like self.log('whatever you want') JSBrowser comes from the use of a ful webkit browser with javascript support to do login, which the WSJ requires. It is only used for login, nothing else.

10-07-2014, 11:37 PM	#8
kovidgoyal creator of calibre Posts: 45,337 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Looking at input/feed_0/article_0/index.html from your attachment, I see Code: <div id="hatFacebook" style="border: none;"><h4>WSJ on Facebook</h4><div style="border: none; padding: 2px 3px;" class="fb-like" data-href="http://www.facebook.com/wsj" data-send="false" data-layout="button_count" data-width="250" data-show-faces="false" data-action="recommend"></div></div> this is most likely because the html parser incorrectly parsed something. So fixing the html in preprocess_raw_html might do the trick. The easiest way to fix it, is like this Code: def preprocess_raw_html(self, html, url): import html5lib root = html5lib.parse(html) from lxml import etree return etree.tostring(root, encoding=unicode) Alternatively you can use regexps to nuke <meta>, <script>, <style> tags and comments which are most often the cause of parse errors.

10-09-2014, 12:50 AM	#13
kovidgoyal creator of calibre Posts: 45,337 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I have committed code to strip those broken comments and improve processing of the new style markup found in Opinion_Article-raw.html https://github.com/kovidgoyal/calibr...fe6b51dfa131a0 Last edited by kovidgoyal; 10-11-2014 at 06:15 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Trouble with Wall Street Journal?	mark_e_h	Recipes	2	07-02-2012 06:28 PM
Wall Street Journal (Free)	awitko	Recipes	11	11-03-2011 12:45 AM
Wall Street Journal	winterescape	Recipes	8	08-14-2011 01:01 PM
Wall Street Journal not available anymore !	ThierryX	Recipes	12	06-20-2011 05:36 AM
Wall Street Journal	dieterpops	Sony Reader	0	12-20-2009 05:51 PM

10-07-2014, 12:00 AM	#2
kovidgoyal creator of calibre Posts: 45,337 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	If the output has empty articles it means the website's markup has changed and the recipe's keep_only_tags most likely needs to be adjusted. Initial parse failed comes from an html parser and is not relevant,the fallback parser is perfectly capable of parsing whatever markup wsj throws at it.

10-07-2014, 11:42 PM	#9
kovidgoyal creator of calibre Posts: 45,337 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	See if this commit helps: https://github.com/kovidgoyal/calibr...d199d96fb31be8

10-08-2014, 09:38 AM	#10
dkfurrow Member Posts: 13 Karma: 10 Join Date: Jun 2013 Device: LG G-Pad 8.3	The last update gives me the same results...the only article that shows in the opinion section is "Corections & Amplifications"...I think the way to go is the preprocess route...it appears to me that the new format must be invalid html in some way. I'll give that a try today.

10-09-2014, 09:39 PM	#15
BobbyVan Enthusiast Posts: 42 Karma: 20 Join Date: Jan 2012 Device: Kindle Paperwhite	Thank you, Kovid. The new recipe works for me as well.

Advert

Advert