![]() |
#1 |
Connoisseur
![]() Posts: 98
Karma: 10
Join Date: Aug 2022
Device: PC
|
Self-built RSS recipe crawl fails after calibre 6.12 update
After importing the OPML file, as long as the RSS feeds from Google news sources all fail, only the title is extracted, no content, please take a look, thank you very much!
I have tested all the RSS feeds imported by OPML and all the feeds from news.Google fail, only the headline, no content. This problem did not occur before the 6.12 update, it is today's update that this problem has occurred Other RSS feeds that are not news.Google.com/rss are extracted normally I reinstalled back to version 6.11 and still have this problem Last edited by fengli; 02-03-2023 at 09:48 PM. |
![]() |
![]() |
![]() |
#2 |
Connoisseur
![]() Posts: 98
Karma: 10
Join Date: Aug 2022
Device: PC
|
For example:
#!/usr/bin/env python # vim:fileencoding=utf-8 from calibre.web.feeds.news import BasicNewsRecipe class AdvancedUserRecipe1675479003(BasicNewsRecipe): title = 'Google新闻-科技巨头Eng' oldest_article = 1 max_articles_per_feed = 100 auto_cleanup = True feeds = [ ('"ASML" - Google News', 'https://news.google.com/news/rss/search?q=ASML&hl=en'), ('"twitter" - Google News', 'https://news.google.com/news/rss/search?q=twitter&hl=en'), ('"intel" - Google News', 'https://news.google.com/news/rss/search?q=intel&hl=en'), ('tencent - Google News', 'http://news.google.com/news?hl=en&gl=us&q=tencent&um=1&ie=UTF-8&output=rss'), ('amazon - Google News', 'http://news.google.com/news?hl=en&gl=us&q=amazon&um=1&ie=UTF-8&output=rss'), ('twitter - Google News', 'https://news.google.com/news/rss/search/section/q/twitter/twitter?hl=en&gl=US'), ('Ubuntu - Google News', 'http://news.google.com/news?hl=en&gl=us&q=Ubuntu&um=1&ie=UTF-8&output=rss'), ('TSMC - Google News', 'https://news.google.com/news/rss/search/section/q/TSMC/TSMC?hl=en&gl=US'), ('Google - Google News', 'https://news.google.com/news/rss/search/section/q/Google/Google?hl=en&gl=US'), ('alibaba - Google News', 'https://news.google.com/news/rss/search/section/q/alibaba/alibaba?hl=en&gl=US'), ('Apple - Google News', 'https://news.google.com/news/rss/search/section/q/Apple/Apple?hl=en&gl=US'), ('"tiktok" - Google News', 'https://news.google.com/news/rss/search/section/q/tiktok/tiktok?hl=en&gl=US&ned=us'), ('huawei - Google News', 'https://news.google.com/news/rss/search/section/q/huawei/huawei?hl=en&gl=US'), ('Amazon - Google News', 'https://news.google.com/news/rss/search/section/q/Amazon/Amazon?hl=en&gl=US'), ('space x - Google News', 'http://news.google.com/news?hl=en&gl=us&q=space%20x&um=1&ie=UTF-8&output=rss'), ('"AMD" - Google News', 'https://news.google.com/news/rss/search?q=AMD&hl=en'), ('"Nvidia" - Google News', 'https://news.google.com/news/rss/search?q=Nvidia&hl=en'), ('"STMicroelectronics" - Google News', 'https://news.google.com/news/rss/search?q=STMicroelectronics&hl=en'), ('"Broadcom" - Google News', 'https://news.google.com/news/rss/search?q=Broadcom&hl=en'), ('qualcomm - Google News', 'https://news.google.com/news/rss/search/section/q/qualcomm/qualcomm?hl=en&gl=US'), ('"MediaTek" - Google News', 'https://news.google.com/news/rss/search?q=MediaTek&hl=en'), ('"ZTE" - Google News', 'https://news.google.com/news/rss/search?q=ZTE&hl=en'), ('"huawei" - Google News', 'https://news.google.com/news/rss/search?q=huawei&hl=en'), ('"TSMC" - Google News', 'https://news.google.com/news/rss/search?q=TSMC&hl=en'), ('"Samsung" - Google News', 'https://news.google.com/news/rss/search?q=Samsung&&hl=en-US&gl=US&ceid=US:en'), ('"meta" - Google News', 'https://news.google.com/news/rss/search?q=meta&hl=en'), ('google新闻', 'https://news.google.com/news/rss/headlines/section/topic/TECHNOLOGY?ned=us&hl=en&gl=US'), ('microsoft', 'https://news.google.com/news/rss/search/section/q/microsoft/microsoft?hl=en&gl=US&ned=us'), ('amazone', 'https://news.google.com/news/rss/search/section/q/amazone/amazone?hl=en&gl=US&ned=us'), ('Google', 'https://news.google.com/news/rss/search/section/q/Google/Google?hl=en&gl=US&ned=us'), ('facebook', 'https://news.google.com/news/rss/search/section/q/facebook/facebook?hl=en&gl=US&ned=us'), ('apple', 'https://news.google.com/news/rss/search/section/q/apple/apple?hl=en&gl=US&ned=us'), ] Last edited by fengli; 02-03-2023 at 09:52 PM. |
![]() |
![]() |
![]() |
#3 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 644
Karma: 85520
Join Date: May 2021
Device: kindle
|
Code:
articles_are_obfuscated = True def get_obfuscated_article(self, url): br = self.get_browser() try: br.open(url) except Exception as e: url = e.hdrs.get('location') soup = self.index_to_soup(url) link = soup.find('a', href=True) html = br.open(link['href']).read() pt = PersistentTemporaryFile('.html') pt.write(html) pt.close() return pt.name Last edited by unkn0wn; 02-04-2023 at 01:50 AM. |
![]() |
![]() |
![]() |
#4 | |
Connoisseur
![]() Posts: 98
Karma: 10
Join Date: Aug 2022
Device: PC
|
Quote:
Test recipe: #!/usr/bin/env python # vim:fileencoding=utf-8 from calibre.web.feeds.news import BasicNewsRecipe class AdvancedUserRecipe1675504328(BasicNewsRecipe): title = 'Google news-ceshi' oldest_article = 1 max_articles_per_feed = 100 auto_cleanup = True articles_are_obfuscated = True def get_obfuscated_article(self, url): br = self.get_browser() try: br.open(url) except Exception as e: url = e.hdrs.get('location') soup = self.index_to_soup(url) link = soup.find('a', href=True) html = br.open(link['href']).read() pt = PersistentTemporaryFile('.html') pt.write(html) pt.close() return pt.name feeds = [ ('"ASML" - Google News', 'https://news.google.com/news/rss/search?q=ASML&hl=en'), ('"intel" - Google News', 'https://news.google.com/news/rss/search?q=intel&hl=en'), ('amazon - Google News', 'http://news.google.com/news?hl=en&gl=us&q=amazon&um=1&ie=UTF-8&output=rss'), ('Ubuntu - Google News', 'http://news.google.com/news?hl=en&gl=us&q=Ubuntu&um=1&ie=UTF-8&output=rss'), ] Google news-ceshi.recipe Last edited by fengli; 02-04-2023 at 05:07 AM. |
|
![]() |
![]() |
![]() |
#5 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 644
Karma: 85520
Join Date: May 2021
Device: kindle
|
add these 2 at the top.
Code:
from calibre import browser from calibre.ptempfile import PersistentTemporaryFile |
![]() |
![]() |
![]() |
#6 |
Connoisseur
![]() Posts: 98
Karma: 10
Join Date: Aug 2022
Device: PC
|
|
![]() |
![]() |
![]() |
#7 | |
Connoisseur
![]() Posts: 98
Karma: 10
Join Date: Aug 2022
Device: PC
|
Crawl Google News RSS suddenly failed, please help, thank you very much
Quote:
N001-economic.recipe Error message: Using user agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Using proxies: {'http': '127.0.0.1:7890', 'https': '127.0.0.1:7890', 'ftp': 'http://127.0.0.1:7890'} Failed to download article: Umstieg auf E-Autos: Autoindustrie: Mehr als jede zweite Firma plant Stellenabbau - Zeit Online from https://news.google.com/rss/articles...iYmF10gEA?oc=5 Traceback (most recent call last): File "calibre\utils\threadpool.py", line 100, in run File "calibre\web\feeds\news.py", line 1201, in fetch_obfuscated_article File "<string>", line 23, in get_obfuscated_article TypeError: 'NoneType' object is not subscriptable |
|
![]() |
![]() |
![]() |
#8 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 644
Karma: 85520
Join Date: May 2021
Device: kindle
|
yea looks like google feeds wont be working anymore.
they've made it harder. Code:
{ "POST": { "scheme": "https", "host": "news.google.com", "filename": "/_/DotsSplashUi/data/batchexecute", "query": { "rpcids": "Fbv4je", "source-path": "/rss/articles/CBMiX2h0dHBzOi8vd3d3LnplaXQuZGUvbmV3cy8yMDI0LTA3LzE4L2F1dG9pbmR1c3RyaWUtbWVoci1hbHMtamVkZS16d2VpdGUtZmlybWEtcGxhbnQtc3RlbGxlbmFiYmF10gEA", "f.sid": "-5052485330158874245", "bl": "boq_dotssplashserver_20240715.12_p1", "hl": "en-IN", "gl": "IN", "soc-app": "140", "soc-platform": "1", "soc-device": "1", "_reqid": "123443", "rt": "c" }, "remote": { "Address": "" } } } response )]}' 221 [["wrb.fr","Fbv4je","[\"garturlres\",\"https://www.zeit.de/news/2024-07/18/autoindustrie-mehr-als-jede-zweite-firma-plant-stellenabbau\"]",null,null,null,"generic"],["di",13],["af.httprm",13,"-7658855237455742109",108]] 25 [["e",4,null,null,257]] |
![]() |
![]() |
![]() |
#9 | |
Connoisseur
![]() Posts: 98
Karma: 10
Join Date: Aug 2022
Device: PC
|
Quote:
|
|
![]() |
![]() |
![]() |
#10 |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 80,655
Karma: 150249619
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
@fengli is there any reason you cannot update to the latest version 7 of calibre?
|
![]() |
![]() |
![]() |
#11 |
Connoisseur
![]() Posts: 98
Karma: 10
Join Date: Aug 2022
Device: PC
|
|
![]() |
![]() |
![]() |
#12 |
Connoisseur
![]() Posts: 98
Karma: 10
Join Date: Aug 2022
Device: PC
|
The post that helped me fix it before is here. It is invalid now. Please help me fix it. Thank you very much.
|
![]() |
![]() |
![]() |
#13 |
Connoisseur
![]() Posts: 98
Karma: 10
Join Date: Aug 2022
Device: PC
|
|
![]() |
![]() |
![]() |
#14 |
Connoisseur
![]() Posts: 98
Karma: 10
Join Date: Aug 2022
Device: PC
|
I suspect that with the development of AI, Google has strengthened its anti-scraping
|
![]() |
![]() |
![]() |
Thread Tools | Search this Thread |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Calibre: Globe and Mail Recipe now fails | xxxronjames | Recipes | 1 | 11-08-2018 03:25 AM |
Kindle voyage slowed to a crawl after update | cerem0ny | Amazon Kindle | 15 | 03-02-2016 01:41 PM |
Built in calibre recipe broken : Prospect Magazine | duluoz | Recipes | 1 | 05-24-2012 08:19 AM |
Calibre rss recipe -- <em> tag in article titles? | TonyDeWonderful | Recipes | 2 | 03-15-2011 12:23 PM |
NY Times Recipe in Calibre 6.36 Fails | keyrunner | Calibre | 1 | 01-28-2010 11:56 AM |