The Spectator - only title and synopsis

nano5 · 10-12-2023, 08:36 PM

The last two "The Spectator" only fetch title and synopsis, since October. The article body content is missing.

unkn0wn · 10-14-2023, 11:11 AM

you can use the attached recipe, it will load all articles but is still a temporary solution (Might fail due to too many requests).

time for someone to figure out and add login code to the recipe.

unkn0wn · 10-15-2023, 01:43 AM

@kovidgoyal how can I make use of wayback machine? is it nytimes exclusive?

kovidgoyal · 10-15-2023, 02:49 AM

yes i would need to add support for spectator to it. what is the url scheme for spectator? if it has a decent url scheme I might be able to do it.

unkn0wn · 10-15-2023, 03:13 AM

https://web.archive.org/web/20231013...support-hamas/
looks like wayback machine doesn't have access to these articles.

https://archive.today/ works but has different url and captcha checks. Can we do something for archive.today?
https://archive.ph/K6f5r

kovidgoyal · 10-15-2023, 03:39 AM

the archive.org entries are paywalled as well, so no point there. As for archive.today no idea never used it.

kovidgoyal · 10-15-2023, 03:58 AM

I took a brief look at archive.is changing the recipe to use it should be as simple as replacing the article urls with urls of the form

https://archive.is/latest/original_url

I dont know what their rate limiting and captcha policies are that will require experimentation.

unkn0wn · 10-15-2023, 07:05 AM

although it loads content in browser.. theres no response in calibre for these urls

Code:

Traceback (most recent call last):
  File "mechanize\_urllib2_fork.py", line 1238, in do_open
  File "http\client.py", line 1374, in getresponse
  File "http\client.py", line 318, in begin
  File "http\client.py", line 287, in _read_status
http.client.RemoteDisconnected: Remote end closed connection without response

I tried using print_version from 'https://archive.is/latest/' + url

if we can get response.. we can also fix WSJ recipe.

kovidgoyal · 10-15-2023, 10:34 AM

Does it work if you use the read_url() function from calibre.scraper.simple

unkn0wn · 10-15-2023, 01:41 PM

Code:

from calibre.scraper.simple import read_url
from calibre.ptempfile import PersistentTemporaryFile
...
    storage = []

    articles_are_obfuscated = True
    def get_obfuscated_article(self, url):
        raw = read_url(self.storage, 'https://archive.is/latest/' + url)
        pt = PersistentTemporaryFile('.html')
        pt.write(raw.encode('utf-8'))
        pt.close()
        return pt.name

I used get_obfuscated article method.

it works, but is there a simpler way?

kovidgoyal · 10-15-2023, 10:40 PM

Easier in what sense?

unkn0wn · 10-15-2023, 11:01 PM

idk, is this the right method though? I am noob here.

can we do it without writing into a temp file through get_obfuscated?

kovidgoyal · 10-16-2023, 01:58 AM

The cost of creating a temp file is insignificant compared to actually downloading so it doesnt matter, but I added some code to allow avoiding the temp file: https://github.com/kovidgoyal/calibr...6689de07213fbe

unkn0wn · 10-16-2023, 12:13 PM

will be able to use this in the next update i guess. Thanks.

10-12-2023, 08:36 PM	#1
nano5 Zealot Posts: 131 Karma: 2136220 Join Date: May 2019 Device: Kindle	The Spectator - only title and synopsis The last two "The Spectator" only fetch title and synopsis, since October. The article body content is missing.

10-15-2023, 07:05 AM	#8
unkn0wn Guru Posts: 644 Karma: 85520 Join Date: May 2021 Device: kindle	although it loads content in browser.. theres no response in calibre for these urls Code: Traceback (most recent call last): File "mechanize\_urllib2_fork.py", line 1238, in do_open File "http\client.py", line 1374, in getresponse File "http\client.py", line 318, in begin File "http\client.py", line 287, in _read_status http.client.RemoteDisconnected: Remote end closed connection without response I tried using print_version from 'https://archive.is/latest/' + url if we can get response.. we can also fix WSJ recipe.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
The Spectator failed	darrenma	Recipes	8	11-17-2022 07:17 PM
Spectator Magazine has no content	mkgtu	Recipes	9	10-01-2022 01:17 PM
Recipe fails - The Spectator UK	nano5	Recipes	4	08-02-2022 06:20 AM
Business Spectator	soctec	Recipes	0	09-27-2012 03:29 AM
Recipe for UK Spectator?	7db	Recipes	1	03-23-2011 05:52 AM

10-15-2023, 01:43 AM	#3
unkn0wn Guru Posts: 644 Karma: 85520 Join Date: May 2021 Device: kindle	@kovidgoyal how can I make use of wayback machine? is it nytimes exclusive?

10-15-2023, 02:49 AM	#4
kovidgoyal creator of calibre Posts: 45,604 Karma: 28548974 Join Date: Oct 2006 Location: Mumbai, India Device: Various	yes i would need to add support for spectator to it. what is the url scheme for spectator? if it has a decent url scheme I might be able to do it.

10-15-2023, 03:13 AM	#5
unkn0wn Guru Posts: 644 Karma: 85520 Join Date: May 2021 Device: kindle	https://web.archive.org/web/20231013...support-hamas/ looks like wayback machine doesn't have access to these articles. https://archive.today/ works but has different url and captcha checks. Can we do something for archive.today? https://archive.ph/K6f5r

10-15-2023, 03:39 AM	#6
kovidgoyal creator of calibre Posts: 45,604 Karma: 28548974 Join Date: Oct 2006 Location: Mumbai, India Device: Various	the archive.org entries are paywalled as well, so no point there. As for archive.today no idea never used it.

10-15-2023, 03:58 AM	#7
kovidgoyal creator of calibre Posts: 45,604 Karma: 28548974 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I took a brief look at archive.is changing the recipe to use it should be as simple as replacing the article urls with urls of the form https://archive.is/latest/original_url I dont know what their rate limiting and captcha policies are that will require experimentation.

10-15-2023, 10:34 AM	#9
kovidgoyal creator of calibre Posts: 45,604 Karma: 28548974 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Does it work if you use the read_url() function from calibre.scraper.simple

10-15-2023, 10:40 PM	#11
kovidgoyal creator of calibre Posts: 45,604 Karma: 28548974 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Easier in what sense?

10-15-2023, 11:01 PM	#12
unkn0wn Guru Posts: 644 Karma: 85520 Join Date: May 2021 Device: kindle	idk, is this the right method though? I am noob here. can we do it without writing into a temp file through get_obfuscated?

10-16-2023, 01:58 AM	#13
kovidgoyal creator of calibre Posts: 45,604 Karma: 28548974 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The cost of creating a temp file is insignificant compared to actually downloading so it doesnt matter, but I added some code to allow avoiding the temp file: https://github.com/kovidgoyal/calibr...6689de07213fbe

10-16-2023, 12:13 PM	#14
unkn0wn Guru Posts: 644 Karma: 85520 Join Date: May 2021 Device: kindle	will be able to use this in the next update i guess. Thanks.