Changes to nytimes recipe:
- Fix 404 error and crash for non-existent index pages (Web edition). Non-existent sections are silently ignored.
- Fix crash when articles preceded by ad pages (all editions). A five second delay is inserted before trying to re-serve an article that served an ad page, otherwise the ad is frequently served again.
The handling of the ad has been moved to preprocess_html since skip_ad_pages as implemented in the recipe didn't work (failing with an obscure xml decoding crash) and probably never did work.
Note: there is still an intermittent problem with this in that sometimes a fragment of the ad page appears as the article, and the article itself is loaded as an inline link from the ad page. I'll work on this as time permits but in the mean time, as long as recursions=1, you will get the article (it will follow the ad fragment).
- Include tech blog articles (all editions, turn this off using getTechBlogs=False)
- Include related articles and inline links to NYTimes articles (all editions, turn this off using recursions=0)
- Screen article age via url instead of downloading article and looking at dateline (Web edition, ignore article age by setting oldest_web_article=None). This speeds up the web edition recipe a lot since it no longer has to download articles that are too old to discover they are too old.
- Remove login requirement, it is no longer necessary (all editions)
- The standard recipe is Today's Paper.
- For the Today's Headlines issue, set headlinesOnly=True
- For the Web version, set webEdition=True and set oldest_web_article to the oldest article (in days) you want to download. If you set oldest_web_article=None you will get everything, otherwise set it to number (e.g., 7 for a week, 1 for yesterday and today).
- The technology blogs are attached to each version unless you set getTechBlogs=False. You can control the oldest article (tech_oldest_article)and maximum number of articles per feed (tech_max_articles_per_feed).
Here are typical file sizes for various recipe options. Run time is proportional, so for example the Web version with all articles downloaded can take several hours.
Headlines only: 6MB
Today's Paper: 9MB
Web, 1 day: 14MB
Web, 7 day: 27MB
Web, all: 40MB