|
|
#16 |
|
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 646
Karma: 85520
Join Date: May 2021
Device: kindle
|
the WSJ Magaznie and WSJ News will no longer work. We have been extremely lucky for sometime as I found a work around for WSJ with graphql.
the CAPTCHA page is being faced by archive.is when it fetches content from WSJ, we cant do anything anything to fix it. maybe wait for archive to update. |
|
|
|
|
|
#17 |
|
onlinenewsreader.net
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 336
Karma: 10143
Join Date: Dec 2009
Location: Kelowna BC
Device: Various
|
archive.is screening
It appears that archive.is is screening traffic from wifi networks. Accessing archive.is via mobile networks (probably determined from carrier IP address) doesn’t attract the screening. I’m assuming this is an anti-scraping strategy, and it’s not confined to WSJ.
|
|
|
|
| Advert | |
|
|
|
|
#18 |
|
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 646
Karma: 85520
Join Date: May 2021
Device: kindle
|
someone who is facing this issue must try adding delay to the recipe and tell us if it works.
I don't use this recipe much and i could not replicate this issue for testing. |
|
|
|
|
|
#19 |
|
onlinenewsreader.net
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 336
Karma: 10143
Join Date: Dec 2009
Location: Kelowna BC
Device: Various
|
Delay doesn’t affect this, I get screened on the first attempt to load an article via wifi but if I switch to mobile data and try again the article loads. Note that I’m not using Calibre for this.
|
|
|
|
|
|
#20 |
|
Junior Member
![]() Posts: 1
Karma: 10
Join Date: Nov 2025
Device: Kindle
|
I’m also getting this error. Mobile data didn’t change it.
|
|
|
|
| Advert | |
|
|
|
|
#21 |
|
onlinenewsreader.net
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 336
Karma: 10143
Join Date: Dec 2009
Location: Kelowna BC
Device: Various
|
archive.is behaviour
archive.is uses a combination of web browser detection and geolocation to determine if a captcha challenge should be presented.
Any access from a web browser is challenged. Any access from a USA IP address is challenged. From an IP address outside of the USA, mobile apps using iOS or Android http access are not challenged. However, VPNs don’t help, it seems archive.is detects them and issues a challenge. If the difference between web browser access and iOS/Android app access could be determined, it might be possible to modify the Python mechanize apparatus to mimic the native apps and get around the captcha challenge. However, it would only work for users outside of the USA. So, unless someone can figure out how to successfully respond to a captcha challenge, it looks like the end of the line for recipes that depend on archive.is |
|
|
|
|
|
#22 |
|
onlinenewsreader.net
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 336
Karma: 10143
Join Date: Dec 2009
Location: Kelowna BC
Device: Various
|
More on url blocking
It looks like Cloudflare is being used widely as an anti-scraping and bot blocking service.
Cloudflare has developed a mechanism called "Private Access Tokens" which is subscribed to by iOS and Android to provide validation that a network request is originating from an actual user device. This mechanism is invoked both by web browsers and native apps using iOS or Android network requests. Private Access Tokens are intended to reduce (or even eliminate) the need for captcha challenges to block scrapers and bots, and it seems to be very successful. It looks like archive.is is using Cloudflare and its own mechanisms (see my previous message) to repel scrapers and bots. Interestingly, archive.is issues captcha challenges for access from the iOS Safari browser but not for native apps using iOS URLSession. All of this doesn't suggest a way to get around Cloudflare--calibre is a web scraper and Cloudflare is doing what it is designed to do by blocking it. But it does shed some light on why native apps that access network resources on demand (as opposed to batch scraping them) continue to work. |
|
|
|
![]() |
| Tags |
| wsj |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Fetch WSJ (free) recipe fails | dagon | Recipes | 2 | 03-28-2025 12:04 PM |
| WSJ recipe fails | mjfriedman | Recipes | 13 | 10-17-2019 03:09 PM |
| WSJ recipe fails | ebonytowers | Recipes | 25 | 09-13-2019 07:28 AM |
| Wall Street Journal, WSJ, Free version, recipe improvement for full text of all ar | winterescape | Recipes | 16 | 02-07-2011 02:51 PM |
| Proper code for fetching Print Version from WSJ and NYT? | brad382 | Calibre | 1 | 12-20-2008 02:06 PM |