Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 10-06-2014, 02:11 PM   #1
dkfurrow
Member
dkfurrow began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
Wall Street Journal--feedparser error?

Thought I'd start a new thread for this...Appears to be an different issue than the previous wsj thread.

I started getting errors on WSJ starting Saturday. parse_index works correctly, but when articles are parsed, each returns an error of "Initial parse failed, using more forgiving parsers", resulting in an epub with only empty articles.

A quick search revealed that error message originates with feedparser...I'm guessing the solution is then to alter the downloaded html in some manner in order to conform to feedparser, but I'm not sure how to do this.

Logfile attached. Any advice would be greatly appreciated.

Thanks,
Dale
Attached Files
File Type: txt logfile.txt (144.5 KB, 254 views)
dkfurrow is offline   Reply With Quote
Old 10-07-2014, 12:00 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,337
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
If the output has empty articles it means the website's markup has changed and the recipe's keep_only_tags most likely needs to be adjusted. Initial parse failed comes from an html parser and is not relevant,the fallback parser is perfectly capable of parsing whatever markup wsj throws at it.
kovidgoyal is offline   Reply With Quote
Advert
Old 10-07-2014, 12:02 AM   #3
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,337
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I tried it and it seems to be returning full articles for me, see attached extract which contains only the first four articles. Given the other thread on parse_index() failures I'm guessing the WSJ is in the middle of some kind of phased rollout of changes to their website.
Attached Files
File Type: epub wsj.epub (273.8 KB, 230 views)
kovidgoyal is offline   Reply With Quote
Old 10-07-2014, 07:52 AM   #4
dkfurrow
Member
dkfurrow began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
Thanks for the reply. Yeah I got similar results for my run...you're right, they must be doing some changes.

The only issue I had was the opinion section still not downloading. I know you put in a fix for that a few weeks ago, and the articles in question contain the tag with "article-contents" id, but for some reason it's not working. I tried a few other combinations of tags for the "keep-only" list, but still couldn't get it to work (except by removing keep_only, which puts too much extra stuff in). A sample parse_index function with two articles (one that works, one that doesn't) is below. Do you get similar results?

Code:
def parse_index(self):
        feeds = []
        articles = []
        # will parse
        title1 = 'HP Article WSJ'
        desc1 = 'about Hewlett Packard'
        url1 = 'http://online.wsj.com/articles/hewlett-packard-split-comes-as-more-investors-say-big-isnt-better-1412643100'
        articles.append({'title':title1, 'url':url1, 'description':desc1, 'date':''})

        # won't parse
        title = "Stephens Article in WSJ"
        desc = 'china bubble story'
        url = 'http://online.wsj.com/articles/bret-stephens-hong-kong-pops-the-china-bubble-1412636585'
        articles.append({'title':title, 'url':url, 'description':desc, 'date':''})


        for article in articles:
            print "title:", article['title']
        section = "This Sample Section"
        feeds.append((section, articles))
        return feeds
dkfurrow is offline   Reply With Quote
Old 10-07-2014, 08:21 AM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,337
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
No, both work for me, see attached. Use the --debug-pipeline option to see exactly what HTML is being downloaded with no keep_only_tags. That should help you figure out why it is not working.
Attached Files
File Type: epub wsj.epub (126.2 KB, 258 views)
kovidgoyal is offline   Reply With Quote
Advert
Old 10-07-2014, 05:25 PM   #6
dkfurrow
Member
dkfurrow began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
Hmmm....still can’t get it to work. Attached is the zip file with ‘keep_only’ removed, plus a log file and the raw html. I’ve successfully setup the development environment (Windows) with the intent to get some detail on keep_only usage (should be in RecursiveFetcher class, right?), but can’t get a basic print statement to work from that class. Beyond that, I know:

(1) Even narrowing keep_only to dict(name='article', id='article-contents') didn’t work.

(2) Whatever the problem is, it occurs before the ‘input’ stage.

(3) I see that, for the article which parsed correctly, in input, keep_only removed, a <div> tag replaces the raw html <article> tag, with the same attributes. For the article which didn’t parse, there’s no corresponding <div> tag. Probably worth noting that notepad++ recognizes the <article> tag in the raw file of the one that parsed, but not for the other.

That’s about all I have been able to figure out, sorry if I'm missing something obvious...hard to see what machine or implementation issues may be at work here. The logfile has a JS Browser statement (below) that I’m not familiar with, but other than that any advice you could give on getting some more detail on the ‘pre-input’ processing would be helpful.

JSBrowser msg():https://a248.e.akamai.net/f/248/6767...11505143897:1: Porthole: Using built-in browser support

Thanks,
Dale
Attached Files
File Type: zip wsjTest.zip (1.85 MB, 209 views)
dkfurrow is offline   Reply With Quote
Old 10-07-2014, 11:29 PM   #7
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,337
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You can look at the raw html, before it is parsed by the recursivefetcher class by implementing in your recipe class.

Code:
def preprocess_raw_html(self, raw_html, url):
    open('/some/temp/file', 'wb').write(raw_html)
    return raw_html
then the raw html will be saved to the temp file you chose above.

Then you can see the html after keep_only etc have run by implementing

Code:
def preprocess_html(self, soup):
   open('/some/temp/file2', 'wb').write(str(soup))
   return soup
If yo want to debug the operation of keep_only tags, in fetch/simple.py in the get_soup() method add some lines like self.log('whatever you want')

JSBrowser comes from the use of a ful webkit browser with javascript support to do login, which the WSJ requires. It is only used for login, nothing else.
kovidgoyal is offline   Reply With Quote
Old 10-07-2014, 11:37 PM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,337
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Looking at input/feed_0/article_0/index.html from your attachment, I see

Code:
<div id="hatFacebook" style="border: none;">&lt;h4&gt;WSJ on Facebook&lt;/h4&gt;&lt;div style=&quot;border: none; padding: 2px 3px;&quot; class=&quot;fb-like&quot; data-href=&quot;http://www.facebook.com/wsj&quot; data-send=&quot;false&quot; data-layout=&quot;button_count&quot; data-width=&quot;250&quot; data-show-faces=&quot;false&quot; data-action=&quot;recommend&quot;&gt;&lt;/div&gt;</div>
this is most likely because the html parser incorrectly parsed something. So fixing the html in preprocess_raw_html might do the trick. The easiest way to fix it, is like this

Code:
def preprocess_raw_html(self, html, url):
     import html5lib
     root = html5lib.parse(html)
     from lxml import etree
     return etree.tostring(root, encoding=unicode)
Alternatively you can use regexps to nuke <meta>, <script>, <style> tags and comments which are most often the cause of parse errors.
kovidgoyal is offline   Reply With Quote
Old 10-07-2014, 11:42 PM   #9
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,337
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
See if this commit helps:

https://github.com/kovidgoyal/calibr...d199d96fb31be8
kovidgoyal is offline   Reply With Quote
Old 10-08-2014, 09:38 AM   #10
dkfurrow
Member
dkfurrow began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
The last update gives me the same results...the only article that shows in the opinion section is "Corections & Amplifications"...I think the way to go is the preprocess route...it appears to me that the new format must be invalid html in some way. I'll give that a try today.
dkfurrow is offline   Reply With Quote
Old 10-08-2014, 11:10 PM   #11
dkfurrow
Member
dkfurrow began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
The plot thickens...back to the two articles, hereafter 'HP_Article' and 'Opinion_Article'...I tried html5lib in preprocess, HP_Article downloaded, and Opinion_Article did not (there was an error in the ihatexml.py file in html5lib...not sure that was related).

so I tried parsing the raw data with lxml, isolating the <article> tag, reconstituting the html and passing it out...same result. Not sure if there's further cleaning required here or something else...it seems to me if the html issues directly from lxml (as in this case), it oughta work...clearly that's wrong. Recipe below, zipfile attached has logs, raw html, reprocessed html and epub file.

Code:
#!/usr/bin/env  python
__license__   = 'GPL v3'
__copyright__ = '2008, Kovid Goyal kovid@kovidgoyal.net'
__docformat__ = 'restructuredtext en'

from calibre.web.feeds.news import BasicNewsRecipe
import html5lib
from lxml import etree, html
from lxml.html import builder as E
import copy

# http://online.wsj.com/page/us_in_todays_paper.html

class WallStreetJournal(BasicNewsRecipe):

    title = 'The Wall Street Journal'
    __author__ = 'Kovid Goyal and Joshua Oster-Morris'
    description = 'News and current affairs'
    needs_subscription = True
    language = 'en'

    compress_news_images = True
    compress_news_images_auto_size = 5
    max_articles_per_feed = 1000
    timefmt  = ' [%a, %b %d, %Y]'
    no_stylesheets = True
    ignore_duplicate_articles = {'url'}
    suffix_dict = {'1412643100': 'HP_ARTICLE', '1412636585': 'Opinion_Article'}
    print_files = True
    print_file_loc = 'E:\\Temp\\wsjTest\\'

    keep_only_tags = [
        dict(name='h1'), dict(name='h2', attrs={'class':['subhead', 'subHed deck']}),
        dict(name='span', itemprop='author', rel='author'),
        dict(name='article', id=['article-contents', 'articleBody']),
        dict(name='div', id='article_story_body'),
        dict(name='div', attrs={'class':'snippet-ad-login'}),
        dict(name='div', attrs={'data-module-name':'resp.module.article.articleBody'}),
    ]

    def preprocess_raw_html(self, raw_html, url):
        # root = html5lib.parse(raw_html, treebuilder='lxml', namespaceHTMLElements=False)
        html_parser = etree.HTMLParser()
        html_parsed = etree.fromstring(raw_html, parser=html_parser)
        selected = html_parsed.xpath("""//article[@id=('article-contents' or 'articleBody')]""")
        html_out = E.HTML(E.BODY(selected[0]))
        self.log( "Preprocessing URL:",  url)
        name = self.suffix_dict[url.split("-")[-1:][0]]
        output = etree.tostring(html_out)
        if self.print_files:
            open(self.print_file_loc + name + '-raw.html', 'wb').write(raw_html)
            open(self.print_file_loc + name + '-preprocessed.html', 'wb').write(output)
        return output

    remove_tags = [
        dict(attrs={'class':['insetButton', 'insettipBox']}),
        dict(name='span', attrs={'data-country-code':True, 'data-ticker-code':True}),
    ]

    use_javascript_to_login = True

    def javascript_login(self, br, username, password):
        br.visit('https://id.wsj.com/access/pages/wsj/us/login_standalone.html?mg=com-wsj', timeout=120)
        f = br.select_form(nr=0)
        f['username'] = username
        f['password'] = password
        br.submit(timeout=120)

    def populate_article_metadata(self, article, soup, first):
        if first and hasattr(self, 'add_toc_thumbnail'):
            picdiv = soup.find('img')
            if picdiv is not None:
                self.add_toc_thumbnail(article,picdiv['src'])

    def preprocess_html(self, soup):
        # Remove thumbnail for zoomable images
        for div in soup.findAll('div', attrs={'class':lambda x: x and 'insetZoomTargetBox' in x.split()}):
            img = div.find('img')
            if img is not None:
                img.extract()
        return soup

    def parse_index(self):
        feeds = []
        articles = []
        # will parse
        title1 = 'HP_Article'
        desc1 = 'A News Article about Hewlett Packard'
        url1 = 'http://online.wsj.com/articles/hewlett-packard-split-comes-as-more-investors-say-big-isnt-better-1412643100'
        articles.append({'title':title1, 'url':url1, 'description':desc1, 'date':''})
        # won't parse
        title = "Opinion_Article"
        desc = 'An Opinion Article about China Bubble'
        url = 'http://online.wsj.com/articles/bret-stephens-hong-kong-pops-the-china-bubble-1412636585'
        articles.append({'title':title, 'url':url, 'description':desc, 'date':''})
        # bundle and return
        section = "This Sample Section"
        feeds.append((section, articles))
        return feeds

    def cleanup(self):
        self.browser.open('http://online.wsj.com/logout?url=http://online.wsj.com')
Attached Files
File Type: zip wsjTest.zip (1.84 MB, 190 views)
dkfurrow is offline   Reply With Quote
Old 10-09-2014, 12:26 AM   #12
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,337
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
There are lots of invalid comments in that raw html, for example,

Code:
    <!--[if lte IE 8]>
  
<div data-module-id="6" data-module-name="article.app/lib/module/ieWarning" data-module-zone="ie_warning" class="zonedModule">
<div class="ie-flag">
  <button class="ie-button"></button>

  <div class="ie-warning-wrapper">
    <p><span id="warning-label">BROWSER UPDATE</span> To gain access to the full experience, please upgrade your browser: </p>
    <ul>
      <li><a href="https://www.google.com/intl/en_us/chrome/browser/">Chrome</a> | </li>
      <li><a href="http://support.apple.com/downloads/#safari">Safari</a> | </li>
      <li><a href="https://www.mozilla.org/en-US/firefox/new/">Firefox</a> | </li>
      <li><a href="http://windows.microsoft.com/en-us/internet-explorer/download-ie">Internet Explorer</a></li>
    </ul><br>
    <p><span id="warning-note">Note: If you are running Internet Explorer 9 and above, make sure it is not in compatibility mode</span></p>
  </div>
</div>

</div> <!-- data-module-name="article.app/lib/module/ieWarning" -->

    <![endif]-->
Note the improper nesting of comments. And then this:

Code:
         <![if ! lte IE 8]>
        <span class="image-enlarge">
          ENLARGE
        </span>
        <![endif]>
The following should take care of it:

Code:
    preprocess_regexps = [
        (re.compile(r'<!--\[if lte IE 8\]>.+?<!\[endif\]-->', re.DOTALL), lambda m: ''),
        (re.compile(r'<!\[if ! lte IE 8\]>.+?<!\[endif\]>', re.DOTALL), lambda m:''),
    ]

Last edited by kovidgoyal; 10-09-2014 at 12:29 AM.
kovidgoyal is offline   Reply With Quote
Old 10-09-2014, 12:50 AM   #13
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,337
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I have committed code to strip those broken comments and improve processing of the new style markup found in Opinion_Article-raw.html

https://github.com/kovidgoyal/calibr...fe6b51dfa131a0

Last edited by kovidgoyal; 10-11-2014 at 06:15 AM.
kovidgoyal is offline   Reply With Quote
Old 10-09-2014, 01:41 PM   #14
dkfurrow
Member
dkfurrow began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
Smile

That works! Thanks for the fix, that cleaned it up nicely. Guess I learned something about the limits of lxml here...Sorry I wan't able to add more value on this one.
Attached Files
File Type: epub 20141009-10-30-41-wsj.epub (10.76 MB, 208 views)
dkfurrow is offline   Reply With Quote
Old 10-09-2014, 09:39 PM   #15
BobbyVan
Enthusiast
BobbyVan began at the beginning.
 
Posts: 42
Karma: 20
Join Date: Jan 2012
Device: Kindle Paperwhite
Thank you, Kovid. The new recipe works for me as well.
BobbyVan is offline   Reply With Quote
Reply

Tags
recipe, wall street journal, wsj


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Trouble with Wall Street Journal? mark_e_h Recipes 2 07-02-2012 06:28 PM
Wall Street Journal (Free) awitko Recipes 11 11-03-2011 12:45 AM
Wall Street Journal winterescape Recipes 8 08-14-2011 01:01 PM
Wall Street Journal not available anymore ! ThierryX Recipes 12 06-20-2011 05:36 AM
Wall Street Journal dieterpops Sony Reader 0 12-20-2009 05:51 PM


All times are GMT -4. The time now is 04:46 AM.


MobileRead.com is a privately owned, operated and funded community.