Recipe le Monde : How to keep only the URL of the printed edition ?

Villard · 02-13-2023, 03:29 AM

Hello
I'm using the recipe " Le Monde : édition abonnés" created by Sylvain Durand.
The daily ebook is large, around 24 Mo, and shows also some articles which were already in the ebook created the day before.
I then would like to get only the URL which only corresponds to the printing newspaper.

In each article html page, there is an indication of the date of the printed date "editionDate":"2023-02-11".
I then would like to keep only the URL which the " editiondate is >= Tomorrow", because the printed newspaper is published in the afternoon with the date of the following day.
As this "editiondate" text is inside a long script description, I think the best it to consider it as a comment in the html page.

Can you give me some hints to get this done ?
Thanks

unkn0wn · 02-14-2023, 01:01 AM

you can use

def preprocess_raw_html(self, raw, *a):

and do raw.search to check if its print edition and then regex group the date and then parse that date by importing

from calibre.utils.date import parse_date
from datetime import datetime, timedelta

and check

if (today - date) > timedelta(1):
self.abort_article('Skipping old article')

if not print edition or if they're older than a day, use self.abort_article to abort those articles

maybe there are other methods.. figure it out.
look for similar stuff in other recipes.

Villard · 02-14-2023, 03:16 AM

Thank you for the help ! I am going to try your suggestions
Villard

Villard · 02-15-2023, 12:49 PM

Thanks to your suggestions, I was able to do it
I use the def preprocess_html(self, soup)
Thanks a lot

I test the recipe during a while and I'll share it to be integrated i Calibre

Villard

Muller · 04-22-2023, 08:11 PM

Bonjour, je me permets d'intervenir sur votre fil car je me demandais si votre modification de la recette "Le Monde : édition abonnés" fonctionne. En effet, Amazon a annoncé la fin prochaine de ses abonnements à des journaux et magazines et je cherche une solution de remplacement.
D'avance merci.

Villard · 01-06-2024, 02:58 AM

Quote:

Originally Posted by Muller

Bonjour, je me permets d'intervenir sur votre fil car je me demandais si votre modification de la recette "Le Monde : édition abonnés" fonctionne. En effet, Amazon a annoncé la fin prochaine de ses abonnements à des journaux et magazines et je cherche une solution de remplacement.
D'avance merci.

Bonjour
Désolé de n'avoir pas répondu. Je ne découvre qu'aujourd'hui votre post; Ci-dessous la recette que j'utilise et qui fonctionne bien. Je dois effectivement la partager !

J'ai listé tous les fils RSS du Monde ! Vous pouvez supprimer les fils qui ne vous intéressent pas.

#!/usr/bin/env python
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2012'

from calibre.web.feeds.news import BasicNewsRecipe, classes
from datetime import date
import re

class LeMonde(BasicNewsRecipe):
title = 'Le Monde'
__author__ = 'Martin Villard'
description = 'Les flux RSS du Monde.fr'
publisher = 'Société Editrice du Monde'
publication_type = 'newspaper'
needs_subscription = 'optional'
language = 'fr'

oldest_article = 1
no_stylesheets = False
remove_empty_feeds = True
ignore_duplicate_articles = {'title', 'url'}
reverse_article_order = True

conversion_options = {
'publisher': publisher
}

masthead_url = 'http://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Le_monde_logo.svg/800px-Le_monde_logo.svg.png'

feeds = [
('International : Europe ', 'https://www.lemonde.fr/europe/rss_full.xml'),
('International : Amériques ', 'https://www.lemonde.fr/ameriques/rss_full.xml'),
('International : Afrique ', 'https://www.lemonde.fr/afrique/rss_full.xml'),
('International : Asie Pacifique', 'https://www.lemonde.fr/asie-pacifique/rss_full.xml'),
('International : Proche-Orient', 'https://www.lemonde.fr/proche-orient/rss_full.xml'),
('International : Royaume-Uni', 'https://www.lemonde.fr/royaume-uni/rss_full.xml'),
('International : Etats-Unis', 'https://www.lemonde.fr/etats-unis/rss_full.xml'),
('International : La une', 'https://www.lemonde.fr/international/rss_full.xml'),
('France : Politique ', 'https://www.lemonde.fr/politique/rss_full.xml'),
('France : Société ', 'https://www.lemonde.fr/societe/rss_full.xml'),
('France : Les décodeurs', 'https://www.lemonde.fr/les-decodeurs/rss_full.xml'),
('France : Justice ', 'https://www.lemonde.fr/justice/rss_full.xml'),
('France : Police ', 'https://www.lemonde.fr/police/rss_full.xml'),
('France : Campus ', 'https://www.lemonde.fr/campus/rss_full.xml'),
('France : Education', 'https://www.lemonde.fr/education/rss_full.xml'),
('Economie : Entreprises ', 'https://www.lemonde.fr/entreprises/rss_full.xml'),
('Economie : Argent ', 'https://www.lemonde.fr/argent/rss_full.xml'),
('Economie : Économie française', 'https://www.lemonde.fr/economie-francaise/rss_full.xml'),
('Economie : Industrie', 'https://www.lemonde.fr/industrie/rss_full.xml'),
('Economie : Emploi ', 'https://www.lemonde.fr/emploi/rss_full.xml'),
('Economie : Immobilier ', 'https://www.lemonde.fr/immobilier/rss_full.xml'),
('Economie : Médias', 'https://www.lemonde.fr/medias/rss_full.xml'),
('Economie : La une', 'https://www.lemonde.fr/economie/rss_full.xml'),
('Planète: Climat ', 'https://www.lemonde.fr/climat/rss_full.xml'),
('Planète: Agriculture ', 'https://www.lemonde.fr/agriculture/rss_full.xml'),
('Planète: Environnement', 'https://www.lemonde.fr/environnement/rss_full.xml'),
('Planète: La une', 'https://www.lemonde.fr/planete/rss_full.xml'),
('Sciences : Espace ', 'https://www.lemonde.fr/espace/rss_full.xml'),
('Sciences : Biologie ', 'https://www.lemonde.fr/biologie/rss_full.xml'),
('Sciences : Médecine ', 'https://www.lemonde.fr/medecine/rss_full.xml'),
('Sciences : Physique ', 'https://www.lemonde.fr/physique/rss_full.xml'),
('Sciences : Santé', 'https://www.lemonde.fr/sante/rss_full.xml'),
('Sciences : La une', 'https://www.lemonde.fr/sciences/rss_full.xml'),
('Culture : Cinéma ', 'https://www.lemonde.fr/cinema/rss_full.xml'),
('Culture : Musiques ', 'https://www.lemonde.fr/musiques/rss_full.xml'),
('Culture : Télévision et radio', 'https://www.lemonde.fr/televisions-radio/rss_full.xml'),
('Culture : Le Monde des livres', 'https://www.lemonde.fr/livres/rss_full.xml'),
('Culture : Arts ', 'https://www.lemonde.fr/arts/rss_full.xml'),
('Culture : Scènes', 'https://www.lemonde.fr/scenes/rss_full.xml'),
('Culture : La une', 'https://www.lemonde.fr/culture/rss_full.xml'),
('Opinions : La une', 'https://www.lemonde.fr/idees/rss_full.xml'),
('Opinions : éditoriaux', 'https://www.lemonde.fr/editoriaux/rss_full.xml'),
('Opinions : chroniques ', 'https://www.lemonde.fr/chroniques/rss_full.xml'),
('Opinions : tribunes', 'https://www.lemonde.fr/tribunes/rss_full.xml'),
('Pixels : Jeux vidéo', 'https://www.lemonde.fr/jeux-video/rss_full.xml'),
('Pixels : Culture web', 'https://www.lemonde.fr/cultures-web/rss_full.xml'),
('Pixels : La une', 'https://www.lemonde.fr/pixels/rss_full.xml'),
('Sport : Football ', 'https://www.lemonde.fr/football/rss_full.xml'),
('Sport : Rugby ', 'https://www.lemonde.fr/rugby/rss_full.xml'),
('Sport : Tennis ', 'https://www.lemonde.fr/tennis/rss_full.xml'),
('Sport : Cyclisme ', 'https://www.lemonde.fr/cyclisme/rss_full.xml'),
('Sport : Basket', 'https://www.lemonde.fr/basket/rss_full.xml'),
('Sport : La une', 'https://www.lemonde.fr/sport/rss_full.xml'),
('M le mag : L’époque ', 'https://www.lemonde.fr/m-perso/rss_full.xml'),
('M le mag : Styles ', 'https://www.lemonde.fr/m-styles/rss_full.xml'),
('M le mag : Gastronomie ', 'https://www.lemonde.fr/gastronomie/rss_full.xml'),
('M le mag : Recettes du Monde', 'https://www.lemonde.fr/les-recettes-du-monde/rss_full.xml'),
('M le mag : Sexo', 'https://www.lemonde.fr/sexo/rss_full.xml'),
('M le mag : La une', 'https://www.lemonde.fr/m-le-mag/rss_full.xml'),
('Actualités : A la une', 'https://www.lemonde.fr/rss/une.xml'),
('Actualités : En continu', 'https://www.lemonde.fr/rss/en_continu.xml'),
('Actualités : Vidéos ', 'https://www.lemonde.fr/videos/rss_full.xml'),
('Actualités : Portfolios', 'https://www.lemonde.fr/photo/rss_full.xml'),
]

keep_only_tags = [
classes('article__header'),
dict(name='section', attrs={'class': ['article__cover', 'article__content', 'article__heading',
'article__wrapper']})
]

remove_tags = [
classes('article__status meta__reading-time meta__social multimedia-embed'),
dict(name=['footer', 'link']),
dict(name='img', attrs={'class': ['article__author-picture']}),
dict(name='section', attrs={'class': ['inread js-services-inread', 'catcher catcher--inline', 'inread inread--NL js-services-inread', 'article__reactions', 'author', 'catcher', 'portfolio', 'services-inread']})
]

remove_attributes = [
'data-sizes', 'height', 'sizes', 'width'
]

preprocess_regexps = [
# insert space between author name and description
(re.compile(r'(]*>)([^<]*)',
re.IGNORECASE), lambda match: match.group(1) + ' ' + match.group(2)),
# insert " | " between article type and description
(re.compile(r'(]*>[^<]*)()',
re.IGNORECASE), lambda match: match.group(1) + ' | ' + match.group(2))
]

extra_css = '''
h2 { font-size: 1em; }
h3 { font-size: 1em; }
.article__desc { font-weight: bold; }
.article__fact { font-weight: bold; text-transform: uppercase; }
.article__kicker { text-transform: uppercase; }
.article__legend { font-size: 0.6em; margin-bottom: 1em; }
.article__title { margin-top: 0em; }
'''

def get_browser(self):
br = BasicNewsRecipe.get_browser(self)
if self.username is not None and self.password is not None:
try:
br.open('https://secure.lemonde.fr/sfuser/connexion')
br.select_form(nr=0)
br['email'] = self.username
br['password'] = self.password
br.submit()
except Exception as e:
self.log('Login failed with error:', str(e))
return br

def get_cover_url(self):
# today's date is a reasonable guess for the ID of the cover
cover_id = date.today().strftime('%Y%m%d')
soup = self.index_to_soup('https://www.lemonde.fr/')
a = soup.find('a', {'id': 'jelec_link', 'style': True})
if a and a['style']:
url = a['style'].split('/')
if len(url) > 5 and url[3].isdigit():
# overwrite guess if actual cover ID was found
cover_id = url[3]
return 'https://www.lemonde.fr/thumbnail/journal/' + cover_id + '/1000/1490'

def get_article_url(self, article):
url = BasicNewsRecipe.get_article_url(self, article)
# skip articles without relevant content (e.g., videos)
for el in 'blog chat live podcasts portfolio video visuel'.split():
if '/' + el + '/' in url:
self.log('Skipping URL', url)
self.abort_article()
return url

def preprocess_html(self, soup):
# when an image is available in multiple sizes, select the smallest one
for img in soup.find_all('img', {'data-srcset': True}):
print ("IMGDDPYL0 = ", img)
data_srcset = img['data-srcset'].split()
print ("IMGDDPYL1 = ", data_srcset)
if len(data_srcset) > 1:
img['src'] = data_srcset[-2]
print("IMGDDPYL2 = " ,img['src'])
del img['data-srcset']
return soup

def postprocess_html(self, soup, first_fetch):
# remove local hyperlinks
for a in soup.find_all('a', {'href': True}):
if '.lemonde.fr/' in a['href']:
a.replace_with(self.tag_to_string(a))
# clean up header
for ul in soup.find_all('ul', {'class': 'breadcrumb'}):
div = soup.new_tag('div')
category = ''
for li in ul.find_all('li', {'class': True}):
category += self.tag_to_string(li).strip().upper() + ' - '
div.string = category[:-3]
ul.replace_with(div)
return soup

calibre_most_common_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'

Teebob · 01-22-2024, 12:08 PM

Hello and thanks for sharing the code. I tried many times in many different ways. I am still facing the issue of the recipe producing a super large file (47Mb). It looks like it may continue to extract old articles maybe? I cannot even load the file to my kindle. I tried to locate the piece of code that takes out the old articles. But couldnt find it. Maybe you can give me a hint?

The other strange issue that I have is after running the recipe, it crashes the website lemonde.fr for about an hour !! I have an error 406.

unkn0wn · 01-23-2024, 03:58 AM

@villard recipe should have been shared in [ CODE ] tags.
share your recipe file here, i'll try to fix.

Teebob · 01-23-2024, 02:33 PM

Well. It actually does that when running the default recipe (the one called "lemonde edition abonnés" - i made a copy below).
I just tried just now and it crashed again provoking that weird error 406 on lemonde.fr

However i noticed when I run the other lemonde recipe (the basic one for non subscribers) then it works. And if I put my password I get the full articles. But I think its only for a limited number of feeds.

Code:

#!/usr/bin/env python
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:fdm=marker:ai
from __future__ import absolute_import, division, print_function, unicode_literals

__author__ = 'S. Durand <sylvaindurand@users.noreply.github.com>'
__license__ = 'GPL v3'

'''
lemonde.fr
'''

from calibre.web.feeds.news import BasicNewsRecipe, classes
from datetime import date
import re


class LeMondeNumerique(BasicNewsRecipe):
    title = 'Le Monde: Édition abonnés'
    __author__ = 'Sylvain Durand'
    description = 'La version numérique du quotidien Le Monde'
    publisher = 'Société Editrice du Monde'
    publication_type = 'newspaper'
    needs_subscription = True
    language = 'fr'

    no_stylesheets = True
    ignore_duplicate_articles = {'title', 'url'}

    conversion_options = {
        'publisher': publisher
    }

    masthead_url = 'http://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Le_monde_logo.svg/800px-Le_monde_logo.svg.png'

    lm_sections = [
        'international:International',
        'politique:Politique',
        'societe:Société',
        'economie:Éco',
        'culture:Culture',
        'idees:Idées',
        'planete:Planète',
        'sport:Sport',
        'sciences:Sciences',
        'pixels:Pixels',
        'campus:Campus'
    ]

    keep_only_tags = [
        classes('article__header'),
        dict(name='section', attrs={'class': ['article__content', 'article__heading',
                                              'article__wrapper']})
    ]

    remove_tags = [
        classes('article__status meta__date meta__reading-time meta__social multimedia-embed'),
        dict(name=['footer', 'link']),
        dict(name='img', attrs={'class': ['article__author-picture']}),
        dict(name='section', attrs={'class': ['article__reactions', 'author', 'catcher',
                                              'portfolio', 'services-inread']})
    ]

    remove_attributes = [
        'data-sizes', 'height', 'sizes', 'width'
    ]

    preprocess_regexps = [
        # insert space between author name and description
        (re.compile(r'(<span class="[^"]*author__desc[^>]*>)([^<]*</span>)',
                    re.IGNORECASE), lambda match: match.group(1) + ' ' + match.group(2)),
        # insert " | " between article type and description
        (re.compile(r'(<span class="[^"]*article__kicker[^>]*>[^<]*)(</span>)',
                    re.IGNORECASE), lambda match: match.group(1) + ' | ' + match.group(2))
    ]

    extra_css = '''
        h2 { font-size: 1em; }
        h3 { font-size: 1em; }
        .article__desc { font-weight: bold; }
        .article__fact { font-weight: bold; text-transform: uppercase; }
        .article__kicker { text-transform: uppercase; }
        .article__legend { font-size: 0.6em; margin-bottom: 1em; }
        .article__title { margin-top: 0em; }
    '''

    def get_browser(self):
        br = BasicNewsRecipe.get_browser(self)
        if self.username is not None and self.password is not None:
            try:
                br.open('https://secure.lemonde.fr/sfuser/connexion')
                br.select_form(nr=0)
                br['email'] = self.username
                br['password'] = self.password
                br.submit()
            except Exception as e:
                self.log('Login failed with error:', str(e))
        return br

    def get_cover_url(self):
        # today's date is a reasonable guess for the ID of the cover
        cover_id = date.today().strftime('%Y%m%d')
        soup = self.index_to_soup('https://www.lemonde.fr/')
        a = soup.find('a', {'id': 'jelec_link', 'style': True})
        if a and a['style']:
            url = a['style'].split('/')
            if len(url) > 5 and url[3].isdigit():
                # overwrite guess if actual cover ID was found
                cover_id = url[3]
        return 'https://www.lemonde.fr/thumbnail/journal/' + cover_id + '/1000/1490'

    def parse_index(self):
        ans = []
        for x in self.lm_sections:
            s, section_title = x.partition(':')[::2]
            self.log('Processing section', section_title, '...')
            articles = list(self.parse_section('https://www.lemonde.fr/%s/' % s))
            if articles:
                ans.append((section_title, articles))
        return ans

    def parse_section(self, url):
        soup = self.index_to_soup(url)
        for article in soup.find_all('section', {'class': 'teaser'}):
            # extract URL
            a = article.find('a', {'class': 'teaser__link', 'href': True})
            if a is None:
                continue
            url = a['href']
            # skip articles without relevant content (e.g., videos)
            for el in 'blog chat live newsletters podcasts portfolio video visuel'.split():
                if '/' + el + '/' in url:
                    url = None
                    break
            if url is None:
                continue
            # extract title
            h3 = article.find('h3', {'class': 'teaser__title'})
            if h3 is None:
                continue
            title = self.tag_to_string(h3)
            # extract description
            desc = ''
            p = article.find('p', {'class': 'teaser__desc'})
            if p is not None:
                desc = self.tag_to_string(p)
            self.log('\tFound article', title, 'at', url)
            yield {'title': title, 'url': url, 'description': desc}

    def preprocess_html(self, soup):
        # when an image is available in multiple sizes, select the smallest one
        for img in soup.find_all('img', {'data-srcset': True}):
            data_srcset = img['data-srcset'].split()
            if len(data_srcset) > 1:
                img['src'] = data_srcset[-2]
                del img['data-srcset']
        return soup

    def postprocess_html(self, soup, first_fetch):
        # remove local hyperlinks
        for a in soup.find_all('a', {'href': True}):
            if '.lemonde.fr/' in a['href']:
                a.replace_with(self.tag_to_string(a))
        # clean up header
        for ul in soup.find_all('ul', {'class': 'breadcrumb'}):
            div = soup.new_tag('div')
            category = ''
            for li in ul.find_all('li', {'class': True}):
                category += self.tag_to_string(li).strip().upper() + ' - '
                div.string = category[:-3]
            ul.replace_with(div)
        return soup


calibre_most_common_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'

unkn0wn · 01-24-2024, 02:09 AM

i actually asked for the @villards recipe, hoping that you fixed it with indents and all, and tried it. You thanked him for sharing code?
The default recipe hasn't been updated to match his recipe.

If you think your attached recipe works, you can just substitute def parse_index with feeds list from villards and add oldest_article = 1 to get all sections.

Villard · 01-24-2024, 11:03 AM

Hello
I give you my recipe. It works fine for me every day. You need of course to subscribe to Le Monde and to enter your account identifiers inside the ebook-convert.exe command.

I know I've to share several recipes I'm working on and publish them for a future Calibre version. Sorry not to have done it yet.

Code:

#!/usr/bin/env python
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals

__license__ = 'GPL v3'
__copyright__ = '2012'



from calibre.web.feeds.news import BasicNewsRecipe, classes
from datetime import date
import re


class LeMonde(BasicNewsRecipe):
    title = 'Le Monde'
    __author__ = 'Martin Villard'
    description = 'Les flux RSS du Monde.fr'
    publisher = 'Société Editrice du Monde'
    publication_type = 'newspaper'
    needs_subscription = 'optional'
    language = 'fr'


    oldest_article = 1
    no_stylesheets = False
    remove_empty_feeds = True
    ignore_duplicate_articles = {'title', 'url'}
    reverse_article_order = True


    conversion_options = {
        'publisher': publisher
    }

    masthead_url = 'http://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Le_monde_logo.svg/800px-Le_monde_logo.svg.png'

    feeds = [
('International : Europe ', 'https://www.lemonde.fr/europe/rss_full.xml'),
('International : Amériques ', 'https://www.lemonde.fr/ameriques/rss_full.xml'),
('International : Afrique ', 'https://www.lemonde.fr/afrique/rss_full.xml'),
('International : Asie Pacifique', 'https://www.lemonde.fr/asie-pacifique/rss_full.xml'),
('International : Proche-Orient', 'https://www.lemonde.fr/proche-orient/rss_full.xml'),
('International : Royaume-Uni', 'https://www.lemonde.fr/royaume-uni/rss_full.xml'),
('International : Etats-Unis', 'https://www.lemonde.fr/etats-unis/rss_full.xml'),
('International : La une', 'https://www.lemonde.fr/international/rss_full.xml'),
('France : Politique ', 'https://www.lemonde.fr/politique/rss_full.xml'),
('France : Société ', 'https://www.lemonde.fr/societe/rss_full.xml'),
('France : Les décodeurs', 'https://www.lemonde.fr/les-decodeurs/rss_full.xml'),
('France : Justice ', 'https://www.lemonde.fr/justice/rss_full.xml'),
('France : Police ', 'https://www.lemonde.fr/police/rss_full.xml'),
('France : Campus ', 'https://www.lemonde.fr/campus/rss_full.xml'),
('France : Education', 'https://www.lemonde.fr/education/rss_full.xml'),
('Economie : Entreprises ', 'https://www.lemonde.fr/entreprises/rss_full.xml'),
('Economie : Argent ', 'https://www.lemonde.fr/argent/rss_full.xml'),
('Economie : Économie française', 'https://www.lemonde.fr/economie-francaise/rss_full.xml'),
('Economie : Industrie', 'https://www.lemonde.fr/industrie/rss_full.xml'),
('Economie : Emploi ', 'https://www.lemonde.fr/emploi/rss_full.xml'),
('Economie : Immobilier ', 'https://www.lemonde.fr/immobilier/rss_full.xml'),
('Economie : Médias', 'https://www.lemonde.fr/medias/rss_full.xml'),
('Economie : La une', 'https://www.lemonde.fr/economie/rss_full.xml'),
('Planète: Climat ', 'https://www.lemonde.fr/climat/rss_full.xml'),
('Planète: Agriculture ', 'https://www.lemonde.fr/agriculture/rss_full.xml'),
('Planète: Environnement', 'https://www.lemonde.fr/environnement/rss_full.xml'),
('Planète: La une', 'https://www.lemonde.fr/planete/rss_full.xml'),
('Sciences : Espace ', 'https://www.lemonde.fr/espace/rss_full.xml'),
('Sciences : Biologie ', 'https://www.lemonde.fr/biologie/rss_full.xml'),
('Sciences : Médecine ', 'https://www.lemonde.fr/medecine/rss_full.xml'),
('Sciences : Physique ', 'https://www.lemonde.fr/physique/rss_full.xml'),
('Sciences : Santé', 'https://www.lemonde.fr/sante/rss_full.xml'),
('Sciences : La une', 'https://www.lemonde.fr/sciences/rss_full.xml'),
('Culture : Cinéma ', 'https://www.lemonde.fr/cinema/rss_full.xml'),
('Culture : Musiques ', 'https://www.lemonde.fr/musiques/rss_full.xml'),
('Culture : Télévision et radio', 'https://www.lemonde.fr/televisions-radio/rss_full.xml'),
('Culture : Le Monde des livres', 'https://www.lemonde.fr/livres/rss_full.xml'),
('Culture : Arts ', 'https://www.lemonde.fr/arts/rss_full.xml'),
('Culture : Scènes', 'https://www.lemonde.fr/scenes/rss_full.xml'),
('Culture : La une', 'https://www.lemonde.fr/culture/rss_full.xml'),
('Opinions : La une', 'https://www.lemonde.fr/idees/rss_full.xml'),
('Opinions : éditoriaux', 'https://www.lemonde.fr/editoriaux/rss_full.xml'),
('Opinions : chroniques ', 'https://www.lemonde.fr/chroniques/rss_full.xml'),
('Opinions : tribunes', 'https://www.lemonde.fr/tribunes/rss_full.xml'),
('Pixels : Jeux vidéo', 'https://www.lemonde.fr/jeux-video/rss_full.xml'),
('Pixels : Culture web', 'https://www.lemonde.fr/cultures-web/rss_full.xml'),
('Pixels : La une', 'https://www.lemonde.fr/pixels/rss_full.xml'),
('Sport : Football ', 'https://www.lemonde.fr/football/rss_full.xml'),
('Sport : Rugby ', 'https://www.lemonde.fr/rugby/rss_full.xml'),
('Sport : Tennis ', 'https://www.lemonde.fr/tennis/rss_full.xml'),
('Sport : Cyclisme ', 'https://www.lemonde.fr/cyclisme/rss_full.xml'),
('Sport : Basket', 'https://www.lemonde.fr/basket/rss_full.xml'),
('Sport : La une', 'https://www.lemonde.fr/sport/rss_full.xml'),
('M le mag : L’époque ', 'https://www.lemonde.fr/m-perso/rss_full.xml'),
('M le mag : Styles ', 'https://www.lemonde.fr/m-styles/rss_full.xml'),
('M le mag : Gastronomie ', 'https://www.lemonde.fr/gastronomie/rss_full.xml'),
('M le mag : Recettes du Monde', 'https://www.lemonde.fr/les-recettes-du-monde/rss_full.xml'),
('M le mag : Sexo', 'https://www.lemonde.fr/sexo/rss_full.xml'),
('M le mag : La une', 'https://www.lemonde.fr/m-le-mag/rss_full.xml'),
('Actualités : A la une', 'https://www.lemonde.fr/rss/une.xml'),
('Actualités : En continu', 'https://www.lemonde.fr/rss/en_continu.xml'),
('Actualités : Vidéos ', 'https://www.lemonde.fr/videos/rss_full.xml'),
('Actualités : Portfolios', 'https://www.lemonde.fr/photo/rss_full.xml'),       

    ]

    keep_only_tags = [
        classes('article__header'),
        dict(name='section', attrs={'class': ['article__cover', 'article__content', 'article__heading',
                                              'article__wrapper']})
    ]

    remove_tags = [
        classes('article__status meta__reading-time meta__social multimedia-embed'),
        dict(name=['footer', 'link']),
        dict(name='img', attrs={'class': ['article__author-picture']}),
        dict(name='section', attrs={'class': ['inread js-services-inread', 'catcher catcher--inline', 'inread inread--NL js-services-inread', 'article__reactions', 'author', 'catcher', 'portfolio', 'services-inread']})
    ]

    remove_attributes = [
        'data-sizes', 'height', 'sizes', 'width'
    ]

    preprocess_regexps = [
        # insert space between author name and description
        (re.compile(r'(<span class="[^"]*author__desc[^>]*>)([^<]*</span>)',
                    re.IGNORECASE), lambda match: match.group(1) + ' ' + match.group(2)),
        # insert " | " between article type and description
        (re.compile(r'(<span class="[^"]*article__kicker[^>]*>[^<]*)(</span>)',
                    re.IGNORECASE), lambda match: match.group(1) + ' | ' + match.group(2))
    ]

    extra_css = '''
        h2 { font-size: 1em; }
        h3 { font-size: 1em; }
        .article__desc { font-weight: bold; }
        .article__fact { font-weight: bold; text-transform: uppercase; }
        .article__kicker { text-transform: uppercase; }
        .article__legend { font-size: 0.6em; margin-bottom: 1em; }
        .article__title { margin-top: 0em; }
    '''

    def get_browser(self):
        br = BasicNewsRecipe.get_browser(self)
        if self.username is not None and self.password is not None:
            try:
                br.open('https://secure.lemonde.fr/sfuser/connexion')
                br.select_form(nr=0)
                br['email'] = self.username
                br['password'] = self.password
                br.submit()
            except Exception as e:
                self.log('Login failed with error:', str(e))
        return br

    def get_cover_url(self):
        # today's date is a reasonable guess for the ID of the cover
        cover_id = date.today().strftime('%Y%m%d')
        soup = self.index_to_soup('https://www.lemonde.fr/')
        a = soup.find('a', {'id': 'jelec_link', 'style': True})
        if a and a['style']:
            url = a['style'].split('/')
            if len(url) > 5 and url[3].isdigit():
                # overwrite guess if actual cover ID was found
                cover_id = url[3]
        return 'https://www.lemonde.fr/thumbnail/journal/' + cover_id + '/1000/1490'

    def get_article_url(self, article):
        url = BasicNewsRecipe.get_article_url(self, article)
        # skip articles without relevant content (e.g., videos)
        for el in 'blog chat live podcasts portfolio video visuel'.split():
            if '/' + el + '/' in url:
                self.log('Skipping URL', url)
                self.abort_article()
        return url
    
    
    def preprocess_html(self, soup):
        # when an image is available in multiple sizes, select the smallest one
        for img in soup.find_all('img', {'data-srcset': True}):
            print ("IMGDDPYL0 = ", img)
            data_srcset = img['data-srcset'].split()
            print ("IMGDDPYL1 = ", data_srcset)
            if len(data_srcset) > 1:
                img['src'] = data_srcset[-2]
                print("IMGDDPYL2 = " ,img['src'])                
                del img['data-srcset']
        return soup
        
    def postprocess_html(self, soup, first_fetch):
        # remove local hyperlinks
        for a in soup.find_all('a', {'href': True}):
            if '.lemonde.fr/' in a['href']:
                a.replace_with(self.tag_to_string(a))
        # clean up header
        for ul in soup.find_all('ul', {'class': 'breadcrumb'}):
            div = soup.new_tag('div')
            category = ''
            for li in ul.find_all('li', {'class': True}):
                category += self.tag_to_string(li).strip().upper() + ' - '
                div.string = category[:-3]
            ul.replace_with(div)
        return soup


calibre_most_common_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'

unkn0wn · 01-25-2024, 10:12 AM

https://github.com/unkn0w7n/calibre/...cbd59a35fb3793

Teebob · 01-27-2024, 04:08 AM

Hi
Thanks Villard.
Its been now a couple of days I am using your new recipe. Works fine!

Muller · 01-31-2024, 11:16 PM

Quote:

Originally Posted by Villard

Bonjour
Désolé de n'avoir pas répondu. Je ne découvre qu'aujourd'hui votre post; Ci-dessous la recette que j'utilise et qui fonctionne bien. Je dois effectivement la partager !

J'ai listé tous les fils RSS du Monde ! Vous pouvez supprimer les fils qui ne vous intéressent pas.

#!/usr/bin/env python
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2012'

from calibre.web.feeds.news import BasicNewsRecipe, classes
from datetime import date
import re

class LeMonde(BasicNewsRecipe):
title = 'Le Monde'
__author__ = 'Martin Villard'
description = 'Les flux RSS du Monde.fr'
publisher = 'Société Editrice du Monde'
publication_type = 'newspaper'
needs_subscription = 'optional'
language = 'fr'

oldest_article = 1
no_stylesheets = False
remove_empty_feeds = True
ignore_duplicate_articles = {'title', 'url'}
reverse_article_order = True

conversion_options = {
'publisher': publisher
}

masthead_url = 'http://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Le_monde_logo.svg/800px-Le_monde_logo.svg.png'

feeds = [
('International : Europe ', 'https://www.lemonde.fr/europe/rss_full.xml'),
('International : Amériques ', 'https://www.lemonde.fr/ameriques/rss_full.xml'),
('International : Afrique ', 'https://www.lemonde.fr/afrique/rss_full.xml'),
('International : Asie Pacifique', 'https://www.lemonde.fr/asie-pacifique/rss_full.xml'),
('International : Proche-Orient', 'https://www.lemonde.fr/proche-orient/rss_full.xml'),
('International : Royaume-Uni', 'https://www.lemonde.fr/royaume-uni/rss_full.xml'),
('International : Etats-Unis', 'https://www.lemonde.fr/etats-unis/rss_full.xml'),
('International : La une', 'https://www.lemonde.fr/international/rss_full.xml'),
('France : Politique ', 'https://www.lemonde.fr/politique/rss_full.xml'),
('France : Société ', 'https://www.lemonde.fr/societe/rss_full.xml'),
('France : Les décodeurs', 'https://www.lemonde.fr/les-decodeurs/rss_full.xml'),
('France : Justice ', 'https://www.lemonde.fr/justice/rss_full.xml'),
('France : Police ', 'https://www.lemonde.fr/police/rss_full.xml'),
('France : Campus ', 'https://www.lemonde.fr/campus/rss_full.xml'),
('France : Education', 'https://www.lemonde.fr/education/rss_full.xml'),
('Economie : Entreprises ', 'https://www.lemonde.fr/entreprises/rss_full.xml'),
('Economie : Argent ', 'https://www.lemonde.fr/argent/rss_full.xml'),
('Economie : Économie française', 'https://www.lemonde.fr/economie-francaise/rss_full.xml'),
('Economie : Industrie', 'https://www.lemonde.fr/industrie/rss_full.xml'),
('Economie : Emploi ', 'https://www.lemonde.fr/emploi/rss_full.xml'),
('Economie : Immobilier ', 'https://www.lemonde.fr/immobilier/rss_full.xml'),
('Economie : Médias', 'https://www.lemonde.fr/medias/rss_full.xml'),
('Economie : La une', 'https://www.lemonde.fr/economie/rss_full.xml'),
('Planète: Climat ', 'https://www.lemonde.fr/climat/rss_full.xml'),
('Planète: Agriculture ', 'https://www.lemonde.fr/agriculture/rss_full.xml'),
('Planète: Environnement', 'https://www.lemonde.fr/environnement/rss_full.xml'),
('Planète: La une', 'https://www.lemonde.fr/planete/rss_full.xml'),
('Sciences : Espace ', 'https://www.lemonde.fr/espace/rss_full.xml'),
('Sciences : Biologie ', 'https://www.lemonde.fr/biologie/rss_full.xml'),
('Sciences : Médecine ', 'https://www.lemonde.fr/medecine/rss_full.xml'),
('Sciences : Physique ', 'https://www.lemonde.fr/physique/rss_full.xml'),
('Sciences : Santé', 'https://www.lemonde.fr/sante/rss_full.xml'),
('Sciences : La une', 'https://www.lemonde.fr/sciences/rss_full.xml'),
('Culture : Cinéma ', 'https://www.lemonde.fr/cinema/rss_full.xml'),
('Culture : Musiques ', 'https://www.lemonde.fr/musiques/rss_full.xml'),
('Culture : Télévision et radio', 'https://www.lemonde.fr/televisions-radio/rss_full.xml'),
('Culture : Le Monde des livres', 'https://www.lemonde.fr/livres/rss_full.xml'),
('Culture : Arts ', 'https://www.lemonde.fr/arts/rss_full.xml'),
('Culture : Scènes', 'https://www.lemonde.fr/scenes/rss_full.xml'),
('Culture : La une', 'https://www.lemonde.fr/culture/rss_full.xml'),
('Opinions : La une', 'https://www.lemonde.fr/idees/rss_full.xml'),
('Opinions : éditoriaux', 'https://www.lemonde.fr/editoriaux/rss_full.xml'),
('Opinions : chroniques ', 'https://www.lemonde.fr/chroniques/rss_full.xml'),
('Opinions : tribunes', 'https://www.lemonde.fr/tribunes/rss_full.xml'),
('Pixels : Jeux vidéo', 'https://www.lemonde.fr/jeux-video/rss_full.xml'),
('Pixels : Culture web', 'https://www.lemonde.fr/cultures-web/rss_full.xml'),
('Pixels : La une', 'https://www.lemonde.fr/pixels/rss_full.xml'),
('Sport : Football ', 'https://www.lemonde.fr/football/rss_full.xml'),
('Sport : Rugby ', 'https://www.lemonde.fr/rugby/rss_full.xml'),
('Sport : Tennis ', 'https://www.lemonde.fr/tennis/rss_full.xml'),
('Sport : Cyclisme ', 'https://www.lemonde.fr/cyclisme/rss_full.xml'),
('Sport : Basket', 'https://www.lemonde.fr/basket/rss_full.xml'),
('Sport : La une', 'https://www.lemonde.fr/sport/rss_full.xml'),
('M le mag : L’époque ', 'https://www.lemonde.fr/m-perso/rss_full.xml'),
('M le mag : Styles ', 'https://www.lemonde.fr/m-styles/rss_full.xml'),
('M le mag : Gastronomie ', 'https://www.lemonde.fr/gastronomie/rss_full.xml'),
('M le mag : Recettes du Monde', 'https://www.lemonde.fr/les-recettes-du-monde/rss_full.xml'),
('M le mag : Sexo', 'https://www.lemonde.fr/sexo/rss_full.xml'),
('M le mag : La une', 'https://www.lemonde.fr/m-le-mag/rss_full.xml'),
('Actualités : A la une', 'https://www.lemonde.fr/rss/une.xml'),
('Actualités : En continu', 'https://www.lemonde.fr/rss/en_continu.xml'),
('Actualités : Vidéos ', 'https://www.lemonde.fr/videos/rss_full.xml'),
('Actualités : Portfolios', 'https://www.lemonde.fr/photo/rss_full.xml'),
]

keep_only_tags = [
classes('article__header'),
dict(name='section', attrs={'class': ['article__cover', 'article__content', 'article__heading',
'article__wrapper']})
]

remove_tags = [
classes('article__status meta__reading-time meta__social multimedia-embed'),
dict(name=['footer', 'link']),
dict(name='img', attrs={'class': ['article__author-picture']}),
dict(name='section', attrs={'class': ['inread js-services-inread', 'catcher catcher--inline', 'inread inread--NL js-services-inread', 'article__reactions', 'author', 'catcher', 'portfolio', 'services-inread']})
]

remove_attributes = [
'data-sizes', 'height', 'sizes', 'width'
]

preprocess_regexps = [
# insert space between author name and description
(re.compile(r'(]*>)([^<]*)',
re.IGNORECASE), lambda match: match.group(1) + ' ' + match.group(2)),
# insert " | " between article type and description
(re.compile(r'(]*>[^<]*)()',
re.IGNORECASE), lambda match: match.group(1) + ' | ' + match.group(2))
]

extra_css = '''
h2 { font-size: 1em; }
h3 { font-size: 1em; }
.article__desc { font-weight: bold; }
.article__fact { font-weight: bold; text-transform: uppercase; }
.article__kicker { text-transform: uppercase; }
.article__legend { font-size: 0.6em; margin-bottom: 1em; }
.article__title { margin-top: 0em; }
'''

def get_browser(self):
br = BasicNewsRecipe.get_browser(self)
if self.username is not None and self.password is not None:
try:
br.open('https://secure.lemonde.fr/sfuser/connexion')
br.select_form(nr=0)
br['email'] = self.username
br['password'] = self.password
br.submit()
except Exception as e:
self.log('Login failed with error:', str(e))
return br

def get_cover_url(self):
# today's date is a reasonable guess for the ID of the cover
cover_id = date.today().strftime('%Y%m%d')
soup = self.index_to_soup('https://www.lemonde.fr/')
a = soup.find('a', {'id': 'jelec_link', 'style': True})
if a and a['style']:
url = a['style'].split('/')
if len(url) > 5 and url[3].isdigit():
# overwrite guess if actual cover ID was found
cover_id = url[3]
return 'https://www.lemonde.fr/thumbnail/journal/' + cover_id + '/1000/1490'

def get_article_url(self, article):
url = BasicNewsRecipe.get_article_url(self, article)
# skip articles without relevant content (e.g., videos)
for el in 'blog chat live podcasts portfolio video visuel'.split():
if '/' + el + '/' in url:
self.log('Skipping URL', url)
self.abort_article()
return url

def preprocess_html(self, soup):
# when an image is available in multiple sizes, select the smallest one
for img in soup.find_all('img', {'data-srcset': True}):
print ("IMGDDPYL0 = ", img)
data_srcset = img['data-srcset'].split()
print ("IMGDDPYL1 = ", data_srcset)
if len(data_srcset) > 1:
img['src'] = data_srcset[-2]
print("IMGDDPYL2 = " ,img['src'])
del img['data-srcset']
return soup

def postprocess_html(self, soup, first_fetch):
# remove local hyperlinks
for a in soup.find_all('a', {'href': True}):
if '.lemonde.fr/' in a['href']:
a.replace_with(self.tag_to_string(a))
# clean up header
for ul in soup.find_all('ul', {'class': 'breadcrumb'}):
div = soup.new_tag('div')
category = ''
for li in ul.find_all('li', {'class': True}):
category += self.tag_to_string(li).strip().upper() + ' - '
div.string = category[:-3]
ul.replace_with(div)
return soup

calibre_most_common_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'

Bonjour cher Villard,

Merci beaucoup pour votre réponse et votre partage. Cela fonctionne très bien pour moi !

02-13-2023, 03:29 AM	#1
Villard Connoisseur Posts: 64 Karma: 10 Join Date: May 2016 Device: Koreader running on Kobo Libra 2	Recipe le Monde : How to keep only the URL of the printed edition ? Hello I'm using the recipe " Le Monde : édition abonnés" created by Sylvain Durand. The daily ebook is large, around 24 Mo, and shows also some articles which were already in the ebook created the day before. I then would like to get only the URL which only corresponds to the printing newspaper. In each article html page, there is an indication of the date of the printed date "editionDate":"2023-02-11". I then would like to keep only the URL which the " editiondate is >= Tomorrow", because the printed newspaper is published in the afternoon with the date of the following day. As this "editiondate" text is inside a long script description, I think the best it to consider it as a comment in the html page. Can you give me some hints to get this done ? Thanks

02-15-2023, 12:49 PM	#4
Villard Connoisseur Posts: 64 Karma: 10 Join Date: May 2016 Device: Koreader running on Kobo Libra 2	Thanks to your suggestions, I was able to do it I use the def preprocess_html(self, soup) Thanks a lot I test the recipe during a while and I'll share it to be integrated i Calibre Villard Last edited by Villard; 02-15-2023 at 01:41 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Le Monde édition papier	Matthieu V	Recipes	0	01-20-2022 02:34 PM
Le Monde Edition abonné does not work	LE_MEC	Recipes	41	08-31-2019 10:11 AM
Folha de Sao Paulo [Printed edition] recipe broken	William_M_S	Recipes	24	10-24-2017 04:36 AM
"Le monde: édition abonnés" broken recipe	Acryde	Recipes	2	02-15-2017 04:17 AM
Recipe for "Le monde édition abonné"	darkl	Recipes	13	02-19-2013 10:04 PM

02-14-2023, 01:01 AM	#2
unkn0wn Evangelist Posts: 445 Karma: 82686 Join Date: May 2021 Device: kindle	you can use def preprocess_raw_html(self, raw, *a): and do raw.search to check if its print edition and then regex group the date and then parse that date by importing from calibre.utils.date import parse_date from datetime import datetime, timedelta and check if (today - date) > timedelta(1): self.abort_article('Skipping old article') if not print edition or if they're older than a day, use self.abort_article to abort those articles maybe there are other methods.. figure it out. look for similar stuff in other recipes.

02-14-2023, 03:16 AM	#3
Villard Connoisseur Posts: 64 Karma: 10 Join Date: May 2016 Device: Koreader running on Kobo Libra 2	Thank you for the help ! I am going to try your suggestions Villard

04-22-2023, 08:11 PM	#5
Muller Member Posts: 23 Karma: 10 Join Date: Mar 2018 Device: Kindle oasis	Bonjour, je me permets d'intervenir sur votre fil car je me demandais si votre modification de la recette "Le Monde : édition abonnés" fonctionne. En effet, Amazon a annoncé la fin prochaine de ses abonnements à des journaux et magazines et je cherche une solution de remplacement. D'avance merci.

01-22-2024, 12:08 PM	#7
Teebob Junior Member Posts: 3 Karma: 10 Join Date: Jan 2024 Location: France Device: Kindle Scribe	Hello and thanks for sharing the code. I tried many times in many different ways. I am still facing the issue of the recipe producing a super large file (47Mb). It looks like it may continue to extract old articles maybe? I cannot even load the file to my kindle. I tried to locate the piece of code that takes out the old articles. But couldnt find it. Maybe you can give me a hint? The other strange issue that I have is after running the recipe, it crashes the website lemonde.fr for about an hour !! I have an error 406.

01-23-2024, 03:58 AM	#8
unkn0wn Evangelist Posts: 445 Karma: 82686 Join Date: May 2021 Device: kindle	@villard recipe should have been shared in [ CODE ] tags. share your recipe file here, i'll try to fix.

01-24-2024, 02:09 AM	#10
unkn0wn Evangelist Posts: 445 Karma: 82686 Join Date: May 2021 Device: kindle	i actually asked for the @villards recipe, hoping that you fixed it with indents and all, and tried it. You thanked him for sharing code? The default recipe hasn't been updated to match his recipe. If you think your attached recipe works, you can just substitute def parse_index with feeds list from villards and add oldest_article = 1 to get all sections.

01-25-2024, 10:12 AM	#12
unkn0wn Evangelist Posts: 445 Karma: 82686 Join Date: May 2021 Device: kindle	https://github.com/unkn0w7n/calibre/...cbd59a35fb3793

01-27-2024, 04:08 AM	#13
Teebob Junior Member Posts: 3 Karma: 10 Join Date: Jan 2024 Location: France Device: Kindle Scribe	Hi Thanks Villard. Its been now a couple of days I am using your new recipe. Works fine!

Advert

Advert