Using get_obfuscated_article is a bit overkill, I think. I've been using self.log(soup.prettify()) in preprocess_html() to see the contents. The problem is that I need the URL to re-fetch after doing the sign-in. The advantage of get_obfuscated_article is that it is passed the URL, but I didn't want to deal with the output file. Instead, I overrode fetch_article() to hold onto the URL so I could grab it inside preprocess_html(). While I imagine this forces me to a single thread, the performance is fine (since it is a daily download at 4am). I'm attaching my solution, but I'll continue to tweak it. As for access to the URL and other article attributes, I'm going to start another thread to ask about that. Thanks for the help so far.
|