View Single Post
Old 10-07-2014, 05:25 PM   #6
dkfurrow
Member
dkfurrow began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
Hmmm....still can’t get it to work. Attached is the zip file with ‘keep_only’ removed, plus a log file and the raw html. I’ve successfully setup the development environment (Windows) with the intent to get some detail on keep_only usage (should be in RecursiveFetcher class, right?), but can’t get a basic print statement to work from that class. Beyond that, I know:

(1) Even narrowing keep_only to dict(name='article', id='article-contents') didn’t work.

(2) Whatever the problem is, it occurs before the ‘input’ stage.

(3) I see that, for the article which parsed correctly, in input, keep_only removed, a <div> tag replaces the raw html <article> tag, with the same attributes. For the article which didn’t parse, there’s no corresponding <div> tag. Probably worth noting that notepad++ recognizes the <article> tag in the raw file of the one that parsed, but not for the other.

That’s about all I have been able to figure out, sorry if I'm missing something obvious...hard to see what machine or implementation issues may be at work here. The logfile has a JS Browser statement (below) that I’m not familiar with, but other than that any advice you could give on getting some more detail on the ‘pre-input’ processing would be helpful.

JSBrowser msg():https://a248.e.akamai.net/f/248/6767...11505143897:1: Porthole: Using built-in browser support

Thanks,
Dale
Attached Files
File Type: zip wsjTest.zip (1.85 MB, 213 views)
dkfurrow is offline   Reply With Quote