|
|
#1 |
|
Junior Member
![]() Posts: 4
Karma: 10
Join Date: Nov 2025
Device: Samsung Galaxy Tab S8
|
New York Times recipe blocked as Bot
Just started this morning, appears the NYT is classifying Calibre pulling news as a bot -
InputFormatPlugin: Recipe Input running Downloading recipe urn: builtin:nytimes_sub Trying to get latest version of recipe: nytimes_sub Using user agent: User-Agent: Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; +http://archive.org/details/archive.org_bot) Recipe specific options: web = Todays Paper days = 7 comp = no Traceback (most recent call last): File "runpy.py", line 198, in _run_module_as_main File "runpy.py", line 88, in _run_code File "site.py", line 83, in <module> File "site.py", line 78, in main File "site.py", line 50, in run_entry_point File "calibre\utils\ipc\worker.py", line 213, in main File "calibre\gui2\convert\gui_conversion.py", line 32, in gui_convert_recipe File "calibre\gui2\convert\gui_conversion.py", line 26, in gui_convert File "calibre\ebooks\conversion\plumber.py", line 1089, in run File "calibre\customize\conversion.py", line 242, in __call__ File "calibre\ebooks\conversion\plugins\recipe_input.py ", line 153, in convert File "calibre\web\feeds\news.py", line 1122, in download File "calibre\web\feeds\news.py", line 1300, in build_index File "<string>", line 226, in parse_index File "<string>", line 199, in parse_todays_page File "calibre\web\feeds\news.py", line 752, in index_to_soup File "mechanize\_mechanize.py", line 241, in open_novisit File "mechanize\_mechanize.py", line 313, in _mech_open mechanize._response.get_seek_wrapper_class.<locals >.httperror_seek_wrapper: HTTP Error 403: Not Allowed, Forbidden, Bot Blocked |
|
|
|
|
|
#2 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 46,198
Karma: 29626604
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Yeah, that started today. I dont see an easy workaround however. They have decided to start blocking "bots". For a while pretending to be the wayback machine got past it, doesnt work anymore.
|
|
|
|
| Advert | |
|
|
|
|
#3 |
|
Member
![]() Posts: 11
Karma: 10
Join Date: Nov 2011
Device: Kindle Paperwhite
|
The New York Times Book Review recipe unfortunately is blocked, too.
Hoping you can find a workaround, as usual.
|
|
|
|
|
|
#4 |
|
onlinenewsreader.net
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 336
Karma: 10143
Join Date: Dec 2009
Location: Kelowna BC
Device: Various
|
Interestingly, the New York Times articles are all available via archive.is. I'm pretty sure this is a result of archive.is scraping the nytimes website because it seems unlikely that individual users are archiving articles. Alternatively, perhaps there is a bot that isn't seen as a bot because it has an nytimes subscription and uploads everything daily.
|
|
|
|
|
|
#5 |
|
Junior Member
![]() Posts: 8
Karma: 10
Join Date: Feb 2021
Device: iPad mini
|
also hoping for an eventual solution.
|
|
|
|
| Advert | |
|
|
|
|
#6 |
|
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,325
Karma: 1515835
Join Date: Mar 2009
Location: New Jersey, USA
Device: Kobo Libra Colour, Kindle Paperwhite Signature Edition (2021)
|
Hoping for a workaround here, too. I'll use Instapaper for now, but it would be nice to be able to download the whole paper in one shot.
|
|
|
|
|
|
#7 |
|
Junior Member
![]() Posts: 8
Karma: 10
Join Date: Jan 2021
Device: Kindle app on android tablet
|
Any updates or work-arounds for this?
|
|
|
|
|
|
#8 |
|
Member
![]() ![]() Posts: 23
Karma: 190
Join Date: Nov 2017
Device: Kindle paperwhite
|
For the past week, the recipe has been pulling section headers (but only headers) but not failing. Might this be an opening?
|
|
|
|
|
|
#9 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 46,198
Karma: 29626604
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You mean its downloading page titles or that its downloading the lsit of articles?
|
|
|
|
|
|
#10 |
|
Member
![]() ![]() Posts: 23
Karma: 190
Join Date: Nov 2017
Device: Kindle paperwhite
|
Page titles: 'The Front Page' 'International' 'National' etc as headers for separate (otherwise blank) pages
|
|
|
|
|
|
#11 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 46,198
Karma: 29626604
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Ah no that just means that the mytimes is returning captchas after the initial index download. Look at the download job log and you will see erorr messages about CAPTCHAs
|
|
|
|
|
|
#12 |
|
Junior Member
![]() Posts: 2
Karma: 10
Join Date: Apr 2026
Device: Kindle Oasis
|
Bypassing CAPTCHAS
OK, after the latest changes at NYTimes.com I was also getting only the front page and section indexes with no article content. But last night I managed to create a scheme and a recipe that got the current articles from the NYTimes successfully downloaded (and then transferred to my Kindle Oasis.)
The scheme first requires logging into NYTimes.com with my subscription and then manually extracting the session cookie "NYT-S" and the anti-scraper DataDome cookie "datadome". I leverage the Account pane in the recipe to hold these cookies and inject them in place of my login and password. These cookies should not need refreshing for awhile - "NYT-S" is only recreated on a new browser or after a logout and "datadome" depends on not arousing the suspicions of the Times' "intelligent" DataDome firewall so that it throws up a CAPTCHA. The NYTimes also does not like Calibre's in-built headless browser, so I had to spin up a FlareSolverr instance in Docker on my server which exposes a Chrome browser to use as a proxy; I point the recipe to that browser on one of my server ports. Also, to get this to work I had to restrict the recipe to only 1 download at a time to avoid arousing the suspicions of DataDome's anti-scraper algorithms. I also commented-out a number of NYTimes content sections that I'm not interested in and asked only for the articles from the last 24 hours to keep the download time reasonable. Even still, however, it required 2 hours to download the entire edition of the paper and it puts a fair CPU load onto my Celeron-based NAS server. To reduce the download time I may try altering the way the FlareSolverr Chrome browser is instantiated so that it remains running between fetches instead of being re-started for each download. Has anybody else tried any of these techniques to get past the current NYTimes.com roadblocks? |
|
|
|
|
|
#13 |
|
Junior Member
![]() Posts: 2
Karma: 10
Join Date: Apr 2026
Device: Kindle Oasis
|
I've made a few changes to my recipe and how it accesses FlareSolverr, and I'm now getting a day and a half's worth of NYTimes articles (145 total) downloaded in an hour and 10 minutes. (I'm on a 300Mb/s connection.)
Big improvement on my original scheme! Just as a comparison point, what rates were people getting from the NYTimes before the last round of anti-scraper algorithms were deployed? I am going to tweak it a little more but at this point I think my fetches are rate-limited by the Celeron processor in my Synology NAS, which is hitting 75%-85% CPU load during these downloads. Last edited by audio_inside; 04-05-2026 at 04:18 PM. |
|
|
|
![]() |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| NYT Spanish New York Times Español El Times Recipe | compa | Recipes | 0 | 03-24-2022 02:40 PM |
| New York Times Recipe | dieterpops | Recipes | 1 | 01-20-2013 12:26 PM |
| Which New York Times recipe? | jdomingos76 | Recipes | 1 | 03-25-2011 08:40 PM |
| Help - New York Times Recipe | brutalist | Recipes | 6 | 03-20-2011 10:17 PM |
| New York Times recipe | madrone26 | Calibre | 4 | 04-02-2009 01:13 PM |