Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 04-17-2024, 02:52 AM   #1
martencarlos
Junior Member
martencarlos began at the beginning.
 
martencarlos's Avatar
 
Posts: 6
Karma: 10
Join Date: Apr 2024
Device: Kindle paperwhite 2022
Post Help to finish the recipe of my favorite news site

Hi all,

I am trying to create a working news recipe for elcorreo.com a spanish news site. There is a built in recipe but it just downloads the links and that is it. Not even the cover.

link to official news site: www.elcorreo.com

I managed to find that if you open any article and replace the ".html" with "_amp.html" you can open the 'immersive reader' in edge to read the full article. And inspecting the "_amp.html" site you can find a script json with the content of the article.

So I started to create a custom one but I can only reach so far. I managed to retrieve the cover, title, subtitle and main image and delete the rest that is not relevant but I need help to add the content of the article by replacing the URLs of the articles to search in and finding the Script tag that contains a JSON with the article content.

This is the code that I have so far:

Spoiler:
#!/usr/bin/env python
__license__ = 'GPL v3'
__author__ = 'Carlos Marten based on Kovid Goyal official version'
__copyright__ = '2008, Kovid Goyal kovid@kovidgoyal.net'
description = 'Elcorreo Newspaper (Spain) - v1.0 16.04.2022'
__docformat__ = 'restructuredtext en'

'''
Elcorreo.com
'''

from calibre.web.feeds.news import BasicNewsRecipe
from html5_parser import parse
import datetime
from datetime import date

class Elcorreo(BasicNewsRecipe):
__author__ = 'Carlos Marten'
description = 'Elcorreo'
now = datetime.datetime.now()
title = u'El Correo ['+str(date.today())+']'
publisher = u'Ediciones El Pa\xeds SL'
category = 'News, politics, culture, economy, general interest'

language = 'es'
timefmt = '[%a, %d %b, %Y]'
oldest_article = 5
max_articles_per_feed = 4
recursion = 2

no_stylesheets = True
remove_attributes = ['width', 'height','display','margin','padding', 'position','border']
remove_javascript = True
use_embedded_content = False
ignore_duplicate_articles = {'title', 'url'}
compress_news_images = False

#auto_cleanup = True
#scale_news_images_to_device = True

def getcoverurl():
now = datetime.datetime.now()
return 'https://portada.iperiodico.es/'+str(now.year)+'/0'+str(now.month)+'/'+str(now.day)+'_elcorreo.750.jpg'
cover_url = getcoverurl()

def preprocess_html(self, soup):
for alink in soup.findAll('a'):
if alink.string is not None:
tstr = alink.string
alink.replaceWith(tstr)
return soup

extra_css = '''
img{
all: initial;
width: 100%
}
h1 { font-size: 22px }
h2 { font-size: 20px }

'''

keep_only_tags = [
dict(name='h1', attrs={'class': [
'v-a-t', #title
]}),
dict(name='h2', attrs={'class': [
'v-a-sub-t', #subtitle
]}),

dict(name='script', attrs={'type': 'application/ld+json',}), #json with article (closed)

dict(name='article', attrs={'class': [
'v-a v-a--d v-a--d-bs v-a--p-b', #article
]}),
dict(name='div', attrs={'class': [
'amp-access-hide', #article (closed)
]}),

]

remove_tags = [
dict(attrs={'class': [
'v-drpw__w', #social
'v-mdl-tpc', #section topics related
'content-exclusive-bg', #paywall
'v-d__btn-c', #comenta y reporta error
'v-i-b', #compartir
'v-pill-m', #icono de play y ampliar imagen
'v-mdl-ath__c', #comentarios
]},),
dict(attrs={'class': [
'v-a-img', #image

]},),

]


def postprocess_html(self, soup, first):
return soup

feeds = [
(u'Portada', u'https://www.elcorreo.com/rss/2.0/portada/'),

]


calibre_most_common_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'



I really tried with the API of calibre and reviewing other recipes but cannot manage to do it. I think I need to do something in the preprocesshtml function but no clue, really.

Can someone with extensive recipe knowledge help?
martencarlos is offline   Reply With Quote
Old 04-17-2024, 06:24 AM   #2
martencarlos
Junior Member
martencarlos began at the beginning.
 
martencarlos's Avatar
 
Posts: 6
Karma: 10
Join Date: Apr 2024
Device: Kindle paperwhite 2022
I think I found how to replace the desktop URL with the mobile URL adding this code:

#replace desktop url with mobile url
def get_article_url(self, article):
desktopUrl = BasicNewsRecipe.get_article_url(self, article)
mobileUrl = desktopUrl.replace(".html", "_amp.html")
return mobileUrl

But now I can't retrieve the article image and I am still missing the article content in json inside the script tag.
martencarlos is offline   Reply With Quote
Old 04-17-2024, 07:05 AM   #3
unkn0wn
Evangelist
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 448
Karma: 82686
Join Date: May 2021
Device: kindle
builtin recipe isn't workimg?
unkn0wn is offline   Reply With Quote
Old 04-17-2024, 07:49 AM   #4
martencarlos
Junior Member
martencarlos began at the beginning.
 
martencarlos's Avatar
 
Posts: 6
Karma: 10
Join Date: Apr 2024
Device: Kindle paperwhite 2022
Quote:
Originally Posted by unkn0wn View Post
builtin recipe isn't workimg?
No, it is only including the links to the index and the link to the article.
martencarlos is offline   Reply With Quote
Old 04-18-2024, 02:59 AM   #5
unkn0wn
Evangelist
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 448
Karma: 82686
Join Date: May 2021
Device: kindle
https://github.com/kovidgoyal/calibr...7b66c77715216f

I just tested this and output is too large >120Mb. Help me hash out some of the feeds.

There so many articles, just in past 24 hours from this website.
Code:
feeds = [
        ('Portada', 'http://www.elcorreo.com/rss/atom/portada'),
        ('Mundo', 'http://www.elcorreo.com/rss/atom/?section=internacional'),
        ('Bizkaia', 'http://www.elcorreo.com/rss/atom/?section=bizkaia'),
        ('Opinión', 'https://www.elcorreo.com/rss/atom/?section=opinion'),
        ('Internacional', 'https://www.elcorreo.com/rss/atom/?section=internacional'),
        ('Ciencia', 'https://www.elcorreo.com/rss/atom/?section=ciencia'),
        ('Guipuzkoa', 'http://www.elcorreo.com/rss/atom/?section=gipuzkoa'),
        ('Araba', 'http://www.elcorreo.com/rss/atom/?section=araba'),
        ('La Rioja', 'http://www.elcorreo.com/rss/atom/?section=larioja'),
        ('Miranda', 'http://www.elcorreo.com/rss/atom/?section=miranda'),
        ('Economía', 'http://www.elcorreo.com/rss/atom/?section=economia'),
        ('Culturas', 'http://www.elcorreo.com/rss/atom/?section=culturas'),
        ('Politica', 'http://www.elcorreo.com/rss/atom/?section=politica'),
        ('De tiendas', 'https://www.elcorreo.com/rss/atom/?section=de-tiendas'),
        ('Deportes', 'https://www.elcorreo.com/rss/atom/?section=deportes'),
        ('Elecciones', 'https://www.elcorreo.com/rss/atom/?section=elecciones'),
        ('Sociedad', 'https://www.elcorreo.com/rss/atom/?section=sociedad'),
        ('Vivir', 'https://www.elcorreo.com/rss/atom/?section=vivir'),
        ('Tecnología', 'http://www.elcorreo.com/rss/atom/?section=tecnologia'),
        ('Gente - Estilo', 'http://www.elcorreo.com/rss/atom/?section=gente-estilo'),
        ('Planes', 'http://www.elcorreo.com/rss/atom/?section=planes'),
        ('Athletic', 'http://www.elcorreo.com/rss/atom/?section=athletic'),
        ('Alavés', 'http://www.elcorreo.com/rss/atom/?section=alaves'),
        ('Bilbao Basket', 'http://www.elcorreo.com/rss/atom/?section=bilbaobasket'),
        ('Baskonia', 'http://www.elcorreo.com/rss/atom/?section=baskonia'),
        ('Deportes', 'http://www.elcorreo.com/rss/atom/?section=deportes'),
        ('Jaiak', 'http://www.elcorreo.com/rss/atom/?section=jaiak'),
        ('La Blanca', 'http://www.elcorreo.com/rss/atom/?section=la-blanca-vitoria'),
        ('Aste Nagusia', 'http://www.elcorreo.com/rss/atom/?section=aste-nagusia-bilbao'),
        ('Semana Santa', 'http://www.elcorreo.com/rss/atom/?section=semana-santa'),
        ('Festivales', 'http://www.elcorreo.com/rss/atom/?section=festivales')
    ]
unkn0wn is offline   Reply With Quote
Old 04-18-2024, 09:55 AM   #6
martencarlos
Junior Member
martencarlos began at the beginning.
 
martencarlos's Avatar
 
Posts: 6
Karma: 10
Join Date: Apr 2024
Device: Kindle paperwhite 2022
Hi,

I left the most import ones. There can't be that many articles in 24hours. I guess a lot are duplicates and are in more than one feed. This is not a big newspaper.
Also maybe images are not optimized.

feeds = [
('Portada', 'http://www.elcorreo.com/rss/atom/portada'),
('Mundo', 'http://www.elcorreo.com/rss/atom/?section=internacional'),
('Bizkaia', 'http://www.elcorreo.com/rss/atom/?section=bizkaia'),
('Opinión', 'https://www.elcorreo.com/rss/atom/?section=opinion'),
('Internacional', 'https://www.elcorreo.com/rss/atom/?section=internacional'),
('Ciencia', 'https://www.elcorreo.com/rss/atom/?section=ciencia'),
('Economía', 'http://www.elcorreo.com/rss/atom/?section=economia'),
('Politica', 'http://www.elcorreo.com/rss/atom/?section=politica'),
('Deportes', 'https://www.elcorreo.com/rss/atom/?section=deportes'),
('Tecnología', 'http://www.elcorreo.com/rss/atom/?section=tecnologia'),
('Deportes', 'http://www.elcorreo.com/rss/atom/?section=deportes'),
]
martencarlos is offline   Reply With Quote
Old 04-18-2024, 10:35 AM   #7
martencarlos
Junior Member
martencarlos began at the beginning.
 
martencarlos's Avatar
 
Posts: 6
Karma: 10
Join Date: Apr 2024
Device: Kindle paperwhite 2022
Btw, I just had a look at the new Recipe code and used it to download the newspaper. I don't know how you did it but it looks perfect. Thank you very much!

And you were right, there are a lot of articles and they are not duplicated.

Any ideas how we could reduce the size to be sent via email to the kindle?

Maybe optimize images?

Thanks again! really apretiate it.
martencarlos is offline   Reply With Quote
Old 04-18-2024, 11:20 AM   #8
martencarlos
Junior Member
martencarlos began at the beginning.
 
martencarlos's Avatar
 
Posts: 6
Karma: 10
Join Date: Apr 2024
Device: Kindle paperwhite 2022
Ok by adding the following I managed to downsize to epub to decent size to send via email:

max_articles_per_feed = 10 #articles
compress_news_images = True
martencarlos is offline   Reply With Quote
Reply

Tags
elcorreo, elcorreo.com, recipe, recipe broken, recipe request


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Foreign Affairs recipe for news from the site (not the magazine) mendesitba Recipes 0 12-08-2015 09:14 PM
NHK Easy News (Japanese News site) beemanfunk Recipes 1 12-25-2014 03:44 AM
IDG.se - Recipe for swedish news site khromov Recipes 3 09-18-2011 09:40 PM
Is there a recipe for "Le Figaro", a french news site? mg666 Recipes 0 05-12-2011 05:50 AM


All times are GMT -4. The time now is 01:48 PM.


MobileRead.com is a privately owned, operated and funded community.