![]() |
#1 |
Member
![]() Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
|
Wall Street Journal--feedparser error?
Thought I'd start a new thread for this...Appears to be an different issue than the previous wsj thread.
I started getting errors on WSJ starting Saturday. parse_index works correctly, but when articles are parsed, each returns an error of "Initial parse failed, using more forgiving parsers", resulting in an epub with only empty articles. A quick search revealed that error message originates with feedparser...I'm guessing the solution is then to alter the downloaded html in some manner in order to conform to feedparser, but I'm not sure how to do this. Logfile attached. Any advice would be greatly appreciated. Thanks, Dale |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,337
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
If the output has empty articles it means the website's markup has changed and the recipe's keep_only_tags most likely needs to be adjusted. Initial parse failed comes from an html parser and is not relevant,the fallback parser is perfectly capable of parsing whatever markup wsj throws at it.
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,337
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I tried it and it seems to be returning full articles for me, see attached extract which contains only the first four articles. Given the other thread on parse_index() failures I'm guessing the WSJ is in the middle of some kind of phased rollout of changes to their website.
|
![]() |
![]() |
![]() |
#4 |
Member
![]() Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
|
Thanks for the reply. Yeah I got similar results for my run...you're right, they must be doing some changes.
The only issue I had was the opinion section still not downloading. I know you put in a fix for that a few weeks ago, and the articles in question contain the tag with "article-contents" id, but for some reason it's not working. I tried a few other combinations of tags for the "keep-only" list, but still couldn't get it to work (except by removing keep_only, which puts too much extra stuff in). A sample parse_index function with two articles (one that works, one that doesn't) is below. Do you get similar results? Code:
def parse_index(self): feeds = [] articles = [] # will parse title1 = 'HP Article WSJ' desc1 = 'about Hewlett Packard' url1 = 'http://online.wsj.com/articles/hewlett-packard-split-comes-as-more-investors-say-big-isnt-better-1412643100' articles.append({'title':title1, 'url':url1, 'description':desc1, 'date':''}) # won't parse title = "Stephens Article in WSJ" desc = 'china bubble story' url = 'http://online.wsj.com/articles/bret-stephens-hong-kong-pops-the-china-bubble-1412636585' articles.append({'title':title, 'url':url, 'description':desc, 'date':''}) for article in articles: print "title:", article['title'] section = "This Sample Section" feeds.append((section, articles)) return feeds |
![]() |
![]() |
![]() |
#5 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,337
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
No, both work for me, see attached. Use the --debug-pipeline option to see exactly what HTML is being downloaded with no keep_only_tags. That should help you figure out why it is not working.
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Member
![]() Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
|
Hmmm....still can’t get it to work. Attached is the zip file with ‘keep_only’ removed, plus a log file and the raw html. I’ve successfully setup the development environment (Windows) with the intent to get some detail on keep_only usage (should be in RecursiveFetcher class, right?), but can’t get a basic print statement to work from that class. Beyond that, I know:
(1) Even narrowing keep_only to dict(name='article', id='article-contents') didn’t work. (2) Whatever the problem is, it occurs before the ‘input’ stage. (3) I see that, for the article which parsed correctly, in input, keep_only removed, a <div> tag replaces the raw html <article> tag, with the same attributes. For the article which didn’t parse, there’s no corresponding <div> tag. Probably worth noting that notepad++ recognizes the <article> tag in the raw file of the one that parsed, but not for the other. That’s about all I have been able to figure out, sorry if I'm missing something obvious...hard to see what machine or implementation issues may be at work here. The logfile has a JS Browser statement (below) that I’m not familiar with, but other than that any advice you could give on getting some more detail on the ‘pre-input’ processing would be helpful. JSBrowser msg():https://a248.e.akamai.net/f/248/6767...11505143897:1: Porthole: Using built-in browser support Thanks, Dale |
![]() |
![]() |
![]() |
#7 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,337
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You can look at the raw html, before it is parsed by the recursivefetcher class by implementing in your recipe class.
Code:
def preprocess_raw_html(self, raw_html, url): open('/some/temp/file', 'wb').write(raw_html) return raw_html Then you can see the html after keep_only etc have run by implementing Code:
def preprocess_html(self, soup): open('/some/temp/file2', 'wb').write(str(soup)) return soup JSBrowser comes from the use of a ful webkit browser with javascript support to do login, which the WSJ requires. It is only used for login, nothing else. |
![]() |
![]() |
![]() |
#8 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,337
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Looking at input/feed_0/article_0/index.html from your attachment, I see
Code:
<div id="hatFacebook" style="border: none;"><h4>WSJ on Facebook</h4><div style="border: none; padding: 2px 3px;" class="fb-like" data-href="http://www.facebook.com/wsj" data-send="false" data-layout="button_count" data-width="250" data-show-faces="false" data-action="recommend"></div></div> Code:
def preprocess_raw_html(self, html, url): import html5lib root = html5lib.parse(html) from lxml import etree return etree.tostring(root, encoding=unicode) |
![]() |
![]() |
![]() |
#9 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,337
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
![]() |
![]() |
![]() |
#10 |
Member
![]() Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
|
The last update gives me the same results...the only article that shows in the opinion section is "Corections & Amplifications"...I think the way to go is the preprocess route...it appears to me that the new format must be invalid html in some way. I'll give that a try today.
|
![]() |
![]() |
![]() |
#11 |
Member
![]() Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
|
The plot thickens...back to the two articles, hereafter 'HP_Article' and 'Opinion_Article'...I tried html5lib in preprocess, HP_Article downloaded, and Opinion_Article did not (there was an error in the ihatexml.py file in html5lib...not sure that was related).
so I tried parsing the raw data with lxml, isolating the <article> tag, reconstituting the html and passing it out...same result. Not sure if there's further cleaning required here or something else...it seems to me if the html issues directly from lxml (as in this case), it oughta work...clearly that's wrong. Recipe below, zipfile attached has logs, raw html, reprocessed html and epub file. Code:
#!/usr/bin/env python __license__ = 'GPL v3' __copyright__ = '2008, Kovid Goyal kovid@kovidgoyal.net' __docformat__ = 'restructuredtext en' from calibre.web.feeds.news import BasicNewsRecipe import html5lib from lxml import etree, html from lxml.html import builder as E import copy # http://online.wsj.com/page/us_in_todays_paper.html class WallStreetJournal(BasicNewsRecipe): title = 'The Wall Street Journal' __author__ = 'Kovid Goyal and Joshua Oster-Morris' description = 'News and current affairs' needs_subscription = True language = 'en' compress_news_images = True compress_news_images_auto_size = 5 max_articles_per_feed = 1000 timefmt = ' [%a, %b %d, %Y]' no_stylesheets = True ignore_duplicate_articles = {'url'} suffix_dict = {'1412643100': 'HP_ARTICLE', '1412636585': 'Opinion_Article'} print_files = True print_file_loc = 'E:\\Temp\\wsjTest\\' keep_only_tags = [ dict(name='h1'), dict(name='h2', attrs={'class':['subhead', 'subHed deck']}), dict(name='span', itemprop='author', rel='author'), dict(name='article', id=['article-contents', 'articleBody']), dict(name='div', id='article_story_body'), dict(name='div', attrs={'class':'snippet-ad-login'}), dict(name='div', attrs={'data-module-name':'resp.module.article.articleBody'}), ] def preprocess_raw_html(self, raw_html, url): # root = html5lib.parse(raw_html, treebuilder='lxml', namespaceHTMLElements=False) html_parser = etree.HTMLParser() html_parsed = etree.fromstring(raw_html, parser=html_parser) selected = html_parsed.xpath("""//article[@id=('article-contents' or 'articleBody')]""") html_out = E.HTML(E.BODY(selected[0])) self.log( "Preprocessing URL:", url) name = self.suffix_dict[url.split("-")[-1:][0]] output = etree.tostring(html_out) if self.print_files: open(self.print_file_loc + name + '-raw.html', 'wb').write(raw_html) open(self.print_file_loc + name + '-preprocessed.html', 'wb').write(output) return output remove_tags = [ dict(attrs={'class':['insetButton', 'insettipBox']}), dict(name='span', attrs={'data-country-code':True, 'data-ticker-code':True}), ] use_javascript_to_login = True def javascript_login(self, br, username, password): br.visit('https://id.wsj.com/access/pages/wsj/us/login_standalone.html?mg=com-wsj', timeout=120) f = br.select_form(nr=0) f['username'] = username f['password'] = password br.submit(timeout=120) def populate_article_metadata(self, article, soup, first): if first and hasattr(self, 'add_toc_thumbnail'): picdiv = soup.find('img') if picdiv is not None: self.add_toc_thumbnail(article,picdiv['src']) def preprocess_html(self, soup): # Remove thumbnail for zoomable images for div in soup.findAll('div', attrs={'class':lambda x: x and 'insetZoomTargetBox' in x.split()}): img = div.find('img') if img is not None: img.extract() return soup def parse_index(self): feeds = [] articles = [] # will parse title1 = 'HP_Article' desc1 = 'A News Article about Hewlett Packard' url1 = 'http://online.wsj.com/articles/hewlett-packard-split-comes-as-more-investors-say-big-isnt-better-1412643100' articles.append({'title':title1, 'url':url1, 'description':desc1, 'date':''}) # won't parse title = "Opinion_Article" desc = 'An Opinion Article about China Bubble' url = 'http://online.wsj.com/articles/bret-stephens-hong-kong-pops-the-china-bubble-1412636585' articles.append({'title':title, 'url':url, 'description':desc, 'date':''}) # bundle and return section = "This Sample Section" feeds.append((section, articles)) return feeds def cleanup(self): self.browser.open('http://online.wsj.com/logout?url=http://online.wsj.com') |
![]() |
![]() |
![]() |
#12 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,337
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
There are lots of invalid comments in that raw html, for example,
Code:
<!--[if lte IE 8]> <div data-module-id="6" data-module-name="article.app/lib/module/ieWarning" data-module-zone="ie_warning" class="zonedModule"> <div class="ie-flag"> <button class="ie-button"></button> <div class="ie-warning-wrapper"> <p><span id="warning-label">BROWSER UPDATE</span> To gain access to the full experience, please upgrade your browser: </p> <ul> <li><a href="https://www.google.com/intl/en_us/chrome/browser/">Chrome</a> | </li> <li><a href="http://support.apple.com/downloads/#safari">Safari</a> | </li> <li><a href="https://www.mozilla.org/en-US/firefox/new/">Firefox</a> | </li> <li><a href="http://windows.microsoft.com/en-us/internet-explorer/download-ie">Internet Explorer</a></li> </ul><br> <p><span id="warning-note">Note: If you are running Internet Explorer 9 and above, make sure it is not in compatibility mode</span></p> </div> </div> </div> <!-- data-module-name="article.app/lib/module/ieWarning" --> <![endif]--> Code:
<![if ! lte IE 8]> <span class="image-enlarge"> ENLARGE </span> <![endif]> Code:
preprocess_regexps = [ (re.compile(r'<!--\[if lte IE 8\]>.+?<!\[endif\]-->', re.DOTALL), lambda m: ''), (re.compile(r'<!\[if ! lte IE 8\]>.+?<!\[endif\]>', re.DOTALL), lambda m:''), ] Last edited by kovidgoyal; 10-09-2014 at 12:29 AM. |
![]() |
![]() |
![]() |
#13 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,337
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I have committed code to strip those broken comments and improve processing of the new style markup found in Opinion_Article-raw.html
https://github.com/kovidgoyal/calibr...fe6b51dfa131a0 Last edited by kovidgoyal; 10-11-2014 at 06:15 AM. |
![]() |
![]() |
![]() |
#14 |
Member
![]() Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
|
![]()
That works! Thanks for the fix, that cleaned it up nicely. Guess I learned something about the limits of lxml here...Sorry I wan't able to add more value on this one.
![]() |
![]() |
![]() |
![]() |
#15 |
Enthusiast
![]() Posts: 42
Karma: 20
Join Date: Jan 2012
Device: Kindle Paperwhite
|
Thank you, Kovid. The new recipe works for me as well.
|
![]() |
![]() |
![]() |
Tags |
recipe, wall street journal, wsj |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Trouble with Wall Street Journal? | mark_e_h | Recipes | 2 | 07-02-2012 06:28 PM |
Wall Street Journal (Free) | awitko | Recipes | 11 | 11-03-2011 12:45 AM |
Wall Street Journal | winterescape | Recipes | 8 | 08-14-2011 01:01 PM |
Wall Street Journal not available anymore ! | ThierryX | Recipes | 12 | 06-20-2011 05:36 AM |
Wall Street Journal | dieterpops | Sony Reader | 0 | 12-20-2009 05:51 PM |