Help to finish the recipe of my favorite news site

martencarlos · 04-17-2024, 02:52 AM

Hi all,

I am trying to create a working news recipe for elcorreo.com a spanish news site. There is a built in recipe but it just downloads the links and that is it. Not even the cover.

link to official news site: www.elcorreo.com

I managed to find that if you open any article and replace the ".html" with "_amp.html" you can open the 'immersive reader' in edge to read the full article. And inspecting the "_amp.html" site you can find a script json with the content of the article.

So I started to create a custom one but I can only reach so far. I managed to retrieve the cover, title, subtitle and main image and delete the rest that is not relevant but I need help to add the content of the article by replacing the URLs of the articles to search in and finding the Script tag that contains a JSON with the article content.

This is the code that I have so far:

Spoiler:

I really tried with the API of calibre and reviewing other recipes but cannot manage to do it. I think I need to do something in the preprocesshtml function but no clue, really.

Can someone with extensive recipe knowledge help?

martencarlos · 04-17-2024, 06:24 AM

I think I found how to replace the desktop URL with the mobile URL adding this code:

#replace desktop url with mobile url
def get_article_url(self, article):
desktopUrl = BasicNewsRecipe.get_article_url(self, article)
mobileUrl = desktopUrl.replace(".html", "_amp.html")
return mobileUrl

But now I can't retrieve the article image and I am still missing the article content in json inside the script tag.

unkn0wn · 04-17-2024, 07:05 AM

builtin recipe isn't workimg?

martencarlos · 04-17-2024, 07:49 AM

Quote:

Originally Posted by unkn0wn

builtin recipe isn't workimg?

No, it is only including the links to the index and the link to the article.

unkn0wn · 04-18-2024, 02:59 AM

https://github.com/kovidgoyal/calibr...7b66c77715216f

I just tested this and output is too large >120Mb. Help me hash out some of the feeds.

There so many articles, just in past 24 hours from this website.

Code:

feeds = [
        ('Portada', 'http://www.elcorreo.com/rss/atom/portada'),
        ('Mundo', 'http://www.elcorreo.com/rss/atom/?section=internacional'),
        ('Bizkaia', 'http://www.elcorreo.com/rss/atom/?section=bizkaia'),
        ('Opinión', 'https://www.elcorreo.com/rss/atom/?section=opinion'),
        ('Internacional', 'https://www.elcorreo.com/rss/atom/?section=internacional'),
        ('Ciencia', 'https://www.elcorreo.com/rss/atom/?section=ciencia'),
        ('Guipuzkoa', 'http://www.elcorreo.com/rss/atom/?section=gipuzkoa'),
        ('Araba', 'http://www.elcorreo.com/rss/atom/?section=araba'),
        ('La Rioja', 'http://www.elcorreo.com/rss/atom/?section=larioja'),
        ('Miranda', 'http://www.elcorreo.com/rss/atom/?section=miranda'),
        ('Economía', 'http://www.elcorreo.com/rss/atom/?section=economia'),
        ('Culturas', 'http://www.elcorreo.com/rss/atom/?section=culturas'),
        ('Politica', 'http://www.elcorreo.com/rss/atom/?section=politica'),
        ('De tiendas', 'https://www.elcorreo.com/rss/atom/?section=de-tiendas'),
        ('Deportes', 'https://www.elcorreo.com/rss/atom/?section=deportes'),
        ('Elecciones', 'https://www.elcorreo.com/rss/atom/?section=elecciones'),
        ('Sociedad', 'https://www.elcorreo.com/rss/atom/?section=sociedad'),
        ('Vivir', 'https://www.elcorreo.com/rss/atom/?section=vivir'),
        ('Tecnología', 'http://www.elcorreo.com/rss/atom/?section=tecnologia'),
        ('Gente - Estilo', 'http://www.elcorreo.com/rss/atom/?section=gente-estilo'),
        ('Planes', 'http://www.elcorreo.com/rss/atom/?section=planes'),
        ('Athletic', 'http://www.elcorreo.com/rss/atom/?section=athletic'),
        ('Alavés', 'http://www.elcorreo.com/rss/atom/?section=alaves'),
        ('Bilbao Basket', 'http://www.elcorreo.com/rss/atom/?section=bilbaobasket'),
        ('Baskonia', 'http://www.elcorreo.com/rss/atom/?section=baskonia'),
        ('Deportes', 'http://www.elcorreo.com/rss/atom/?section=deportes'),
        ('Jaiak', 'http://www.elcorreo.com/rss/atom/?section=jaiak'),
        ('La Blanca', 'http://www.elcorreo.com/rss/atom/?section=la-blanca-vitoria'),
        ('Aste Nagusia', 'http://www.elcorreo.com/rss/atom/?section=aste-nagusia-bilbao'),
        ('Semana Santa', 'http://www.elcorreo.com/rss/atom/?section=semana-santa'),
        ('Festivales', 'http://www.elcorreo.com/rss/atom/?section=festivales')
    ]

martencarlos · 04-18-2024, 09:55 AM

Hi,

I left the most import ones. There can't be that many articles in 24hours. I guess a lot are duplicates and are in more than one feed. This is not a big newspaper.
Also maybe images are not optimized.

feeds = [
('Portada', 'http://www.elcorreo.com/rss/atom/portada'),
('Mundo', 'http://www.elcorreo.com/rss/atom/?section=internacional'),
('Bizkaia', 'http://www.elcorreo.com/rss/atom/?section=bizkaia'),
('Opinión', 'https://www.elcorreo.com/rss/atom/?section=opinion'),
('Internacional', 'https://www.elcorreo.com/rss/atom/?section=internacional'),
('Ciencia', 'https://www.elcorreo.com/rss/atom/?section=ciencia'),
('Economía', 'http://www.elcorreo.com/rss/atom/?section=economia'),
('Politica', 'http://www.elcorreo.com/rss/atom/?section=politica'),
('Deportes', 'https://www.elcorreo.com/rss/atom/?section=deportes'),
('Tecnología', 'http://www.elcorreo.com/rss/atom/?section=tecnologia'),
('Deportes', 'http://www.elcorreo.com/rss/atom/?section=deportes'),
]

martencarlos · 04-18-2024, 10:35 AM

Btw, I just had a look at the new Recipe code and used it to download the newspaper. I don't know how you did it but it looks perfect. Thank you very much!

And you were right, there are a lot of articles and they are not duplicated.

Any ideas how we could reduce the size to be sent via email to the kindle?

Maybe optimize images?

Thanks again! really apretiate it.

martencarlos · 04-18-2024, 11:20 AM

Ok by adding the following I managed to downsize to epub to decent size to send via email:

max_articles_per_feed = 10 #articles
compress_news_images = True

04-17-2024, 02:52 AM	#1
martencarlos Junior Member Posts: 6 Karma: 10 Join Date: Apr 2024 Device: Kindle paperwhite 2022	Help to finish the recipe of my favorite news site Hi all, I am trying to create a working news recipe for elcorreo.com a spanish news site. There is a built in recipe but it just downloads the links and that is it. Not even the cover. link to official news site: www.elcorreo.com I managed to find that if you open any article and replace the ".html" with "_amp.html" you can open the 'immersive reader' in edge to read the full article. And inspecting the "_amp.html" site you can find a script json with the content of the article. So I started to create a custom one but I can only reach so far. I managed to retrieve the cover, title, subtitle and main image and delete the rest that is not relevant but I need help to add the content of the article by replacing the URLs of the articles to search in and finding the Script tag that contains a JSON with the article content. This is the code that I have so far: Spoiler: #!/usr/bin/env python __license__ = 'GPL v3' __author__ = 'Carlos Marten based on Kovid Goyal official version' __copyright__ = '2008, Kovid Goyal kovid@kovidgoyal.net' description = 'Elcorreo Newspaper (Spain) - v1.0 16.04.2022' __docformat__ = 'restructuredtext en' ''' Elcorreo.com ''' from calibre.web.feeds.news import BasicNewsRecipe from html5_parser import parse import datetime from datetime import date class Elcorreo(BasicNewsRecipe): __author__ = 'Carlos Marten' description = 'Elcorreo' now = datetime.datetime.now() title = u'El Correo ['+str(date.today())+']' publisher = u'Ediciones El Pa\xeds SL' category = 'News, politics, culture, economy, general interest' language = 'es' timefmt = '[%a, %d %b, %Y]' oldest_article = 5 max_articles_per_feed = 4 recursion = 2 no_stylesheets = True remove_attributes = ['width', 'height','display','margin','padding', 'position','border'] remove_javascript = True use_embedded_content = False ignore_duplicate_articles = {'title', 'url'} compress_news_images = False #auto_cleanup = True #scale_news_images_to_device = True def getcoverurl(): now = datetime.datetime.now() return 'https://portada.iperiodico.es/'+str(now.year)+'/0'+str(now.month)+'/'+str(now.day)+'_elcorreo.750.jpg' cover_url = getcoverurl() def preprocess_html(self, soup): for alink in soup.findAll('a'): if alink.string is not None: tstr = alink.string alink.replaceWith(tstr) return soup extra_css = ''' img{ all: initial; width: 100% } h1 { font-size: 22px } h2 { font-size: 20px } ''' keep_only_tags = [ dict(name='h1', attrs={'class': [ 'v-a-t', #title ]}), dict(name='h2', attrs={'class': [ 'v-a-sub-t', #subtitle ]}), dict(name='script', attrs={'type': 'application/ld+json',}), #json with article (closed) dict(name='article', attrs={'class': [ 'v-a v-a--d v-a--d-bs v-a--p-b', #article ]}), dict(name='div', attrs={'class': [ 'amp-access-hide', #article (closed) ]}), ] remove_tags = [ dict(attrs={'class': [ 'v-drpw__w', #social 'v-mdl-tpc', #section topics related 'content-exclusive-bg', #paywall 'v-d__btn-c', #comenta y reporta error 'v-i-b', #compartir 'v-pill-m', #icono de play y ampliar imagen 'v-mdl-ath__c', #comentarios ]},), dict(attrs={'class': [ 'v-a-img', #image ]},), ] def postprocess_html(self, soup, first): return soup feeds = [ (u'Portada', u'https://www.elcorreo.com/rss/2.0/portada/'), ] calibre_most_common_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36' I really tried with the API of calibre and reviewing other recipes but cannot manage to do it. I think I need to do something in the preprocesshtml function but no clue, really. Can someone with extensive recipe knowledge help?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Foreign Affairs recipe for news from the site (not the magazine)	mendesitba	Recipes	0	12-08-2015 09:14 PM
NHK Easy News (Japanese News site)	beemanfunk	Recipes	1	12-25-2014 03:44 AM
IDG.se - Recipe for swedish news site	khromov	Recipes	3	09-18-2011 09:40 PM
Is there a recipe for "Le Figaro", a french news site?	mg666	Recipes	0	05-12-2011 05:50 AM

04-17-2024, 06:24 AM	#2
martencarlos Junior Member Posts: 6 Karma: 10 Join Date: Apr 2024 Device: Kindle paperwhite 2022	I think I found how to replace the desktop URL with the mobile URL adding this code: #replace desktop url with mobile url def get_article_url(self, article): desktopUrl = BasicNewsRecipe.get_article_url(self, article) mobileUrl = desktopUrl.replace(".html", "_amp.html") return mobileUrl But now I can't retrieve the article image and I am still missing the article content in json inside the script tag.

04-17-2024, 07:05 AM	#3
unkn0wn Evangelist Posts: 448 Karma: 82686 Join Date: May 2021 Device: kindle	builtin recipe isn't workimg?

04-18-2024, 09:55 AM	#6
martencarlos Junior Member Posts: 6 Karma: 10 Join Date: Apr 2024 Device: Kindle paperwhite 2022	Hi, I left the most import ones. There can't be that many articles in 24hours. I guess a lot are duplicates and are in more than one feed. This is not a big newspaper. Also maybe images are not optimized. feeds = [ ('Portada', 'http://www.elcorreo.com/rss/atom/portada'), ('Mundo', 'http://www.elcorreo.com/rss/atom/?section=internacional'), ('Bizkaia', 'http://www.elcorreo.com/rss/atom/?section=bizkaia'), ('Opinión', 'https://www.elcorreo.com/rss/atom/?section=opinion'), ('Internacional', 'https://www.elcorreo.com/rss/atom/?section=internacional'), ('Ciencia', 'https://www.elcorreo.com/rss/atom/?section=ciencia'), ('Economía', 'http://www.elcorreo.com/rss/atom/?section=economia'), ('Politica', 'http://www.elcorreo.com/rss/atom/?section=politica'), ('Deportes', 'https://www.elcorreo.com/rss/atom/?section=deportes'), ('Tecnología', 'http://www.elcorreo.com/rss/atom/?section=tecnologia'), ('Deportes', 'http://www.elcorreo.com/rss/atom/?section=deportes'), ]

04-18-2024, 10:35 AM	#7
martencarlos Junior Member Posts: 6 Karma: 10 Join Date: Apr 2024 Device: Kindle paperwhite 2022	Btw, I just had a look at the new Recipe code and used it to download the newspaper. I don't know how you did it but it looks perfect. Thank you very much! And you were right, there are a lot of articles and they are not duplicated. Any ideas how we could reduce the size to be sent via email to the kindle? Maybe optimize images? Thanks again! really apretiate it.

04-18-2024, 11:20 AM	#8
martencarlos Junior Member Posts: 6 Karma: 10 Join Date: Apr 2024 Device: Kindle paperwhite 2022	Ok by adding the following I managed to downsize to epub to decent size to send via email: max_articles_per_feed = 10 #articles compress_news_images = True