Hi, I have found out the reason why the New YorK Times(web) failed to be fetched: the NYT system will block your IP if you fetch news too fast! So the solution to the problem should be: to add a delay between fetching each HTML page. I set the delay time to be 15 seconds and it works. You can set the delay time to be shorter, but the longer the delay time is, the better. I paste my recipe setting below for your reference. Pay attention to the setting of "delay = 15" :
class NewYorkTimes(BasicNewsRecipe):
if is_web_edition:
title = 'The New York Times (Web)'
description = 'New York Times (Web). You can edit the recipe to remove sections you are not interested in.'
delay = 15
else:
title = 'The New York Times'
description = 'Today\'s New York Times'
encoding = 'utf-8'
__author__ = 'Kovid Goyal'
language = 'en'
ignore_duplicate_articles = {'title', 'url'}
no_stylesheets = True
compress_news_images = True
compress_news_images_auto_size = 5
conversion_options = {'flow_size': 0}
delay = 15
|