Hmmm....still can’t get it to work. Attached is the zip file with ‘keep_only’ removed, plus a log file and the raw html. I’ve successfully setup the development environment (Windows) with the intent to get some detail on keep_only usage (should be in RecursiveFetcher class, right?), but can’t get a basic print statement to work from that class. Beyond that, I know:
(1) Even narrowing keep_only to dict(name='article', id='article-contents') didn’t work.
(2) Whatever the problem is, it occurs before the ‘input’ stage.
(3) I see that, for the article which parsed correctly, in input, keep_only removed, a <div> tag replaces the raw html <article> tag, with the same attributes. For the article which didn’t parse, there’s no corresponding <div> tag. Probably worth noting that notepad++ recognizes the <article> tag in the raw file of the one that parsed, but not for the other.
That’s about all I have been able to figure out, sorry if I'm missing something obvious...hard to see what machine or implementation issues may be at work here. The logfile has a JS Browser statement (below) that I’m not familiar with, but other than that any advice you could give on getting some more detail on the ‘pre-input’ processing would be helpful.
JSBrowser msg():
https://a248.e.akamai.net/f/248/6767...11505143897:1: Porthole: Using built-in browser support
Thanks,
Dale