Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 02-13-2023, 03:29 AM   #1
Villard
Connoisseur
Villard began at the beginning.
 
Posts: 64
Karma: 10
Join Date: May 2016
Device: Koreader running on Kobo Libra 2
Recipe le Monde : How to keep only the URL of the printed edition ?

Hello
I'm using the recipe " Le Monde : édition abonnés" created by Sylvain Durand.
The daily ebook is large, around 24 Mo, and shows also some articles which were already in the ebook created the day before.
I then would like to get only the URL which only corresponds to the printing newspaper.

In each article html page, there is an indication of the date of the printed date "editionDate":"2023-02-11".
I then would like to keep only the URL which the " editiondate is >= Tomorrow", because the printed newspaper is published in the afternoon with the date of the following day.
As this "editiondate" text is inside a long script description, I think the best it to consider it as a comment in the html page.


Can you give me some hints to get this done ?
Thanks
Villard is offline   Reply With Quote
Old 02-14-2023, 01:01 AM   #2
unkn0wn
Evangelist
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 445
Karma: 82686
Join Date: May 2021
Device: kindle
you can use

def preprocess_raw_html(self, raw, *a):

and do raw.search to check if its print edition and then regex group the date and then parse that date by importing

from calibre.utils.date import parse_date
from datetime import datetime, timedelta

and check

if (today - date) > timedelta(1):
self.abort_article('Skipping old article')

if not print edition or if they're older than a day, use self.abort_article to abort those articles

maybe there are other methods.. figure it out.
look for similar stuff in other recipes.
unkn0wn is offline   Reply With Quote
Advert
Old 02-14-2023, 03:16 AM   #3
Villard
Connoisseur
Villard began at the beginning.
 
Posts: 64
Karma: 10
Join Date: May 2016
Device: Koreader running on Kobo Libra 2
Thank you for the help ! I am going to try your suggestions
Villard
Villard is offline   Reply With Quote
Old 02-15-2023, 12:49 PM   #4
Villard
Connoisseur
Villard began at the beginning.
 
Posts: 64
Karma: 10
Join Date: May 2016
Device: Koreader running on Kobo Libra 2
Thanks to your suggestions, I was able to do it
I use the def preprocess_html(self, soup)
Thanks a lot

I test the recipe during a while and I'll share it to be integrated i Calibre

Villard

Last edited by Villard; 02-15-2023 at 01:41 PM.
Villard is offline   Reply With Quote
Old 04-22-2023, 08:11 PM   #5
Muller
Member
Muller began at the beginning.
 
Posts: 23
Karma: 10
Join Date: Mar 2018
Device: Kindle oasis
Bonjour, je me permets d'intervenir sur votre fil car je me demandais si votre modification de la recette "Le Monde : édition abonnés" fonctionne. En effet, Amazon a annoncé la fin prochaine de ses abonnements à des journaux et magazines et je cherche une solution de remplacement.
D'avance merci.
Muller is offline   Reply With Quote
Advert
Old 01-06-2024, 02:58 AM   #6
Villard
Connoisseur
Villard began at the beginning.
 
Posts: 64
Karma: 10
Join Date: May 2016
Device: Koreader running on Kobo Libra 2
Quote:
Originally Posted by Muller View Post
Bonjour, je me permets d'intervenir sur votre fil car je me demandais si votre modification de la recette "Le Monde : édition abonnés" fonctionne. En effet, Amazon a annoncé la fin prochaine de ses abonnements à des journaux et magazines et je cherche une solution de remplacement.
D'avance merci.
Bonjour
Désolé de n'avoir pas répondu. Je ne découvre qu'aujourd'hui votre post; Ci-dessous la recette que j'utilise et qui fonctionne bien. Je dois effectivement la partager !

J'ai listé tous les fils RSS du Monde ! Vous pouvez supprimer les fils qui ne vous intéressent pas.
#!/usr/bin/env python
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2012'

from calibre.web.feeds.news import BasicNewsRecipe, classes
from datetime import date
import re

class LeMonde(BasicNewsRecipe):
title = 'Le Monde'
__author__ = 'Martin Villard'
description = 'Les flux RSS du Monde.fr'
publisher = 'Société Editrice du Monde'
publication_type = 'newspaper'
needs_subscription = 'optional'
language = 'fr'

oldest_article = 1
no_stylesheets = False
remove_empty_feeds = True
ignore_duplicate_articles = {'title', 'url'}
reverse_article_order = True

conversion_options = {
'publisher': publisher
}

masthead_url = 'http://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Le_monde_logo.svg/800px-Le_monde_logo.svg.png'

feeds = [
('International : Europe ', 'https://www.lemonde.fr/europe/rss_full.xml'),
('International : Amériques ', 'https://www.lemonde.fr/ameriques/rss_full.xml'),
('International : Afrique ', 'https://www.lemonde.fr/afrique/rss_full.xml'),
('International : Asie Pacifique', 'https://www.lemonde.fr/asie-pacifique/rss_full.xml'),
('International : Proche-Orient', 'https://www.lemonde.fr/proche-orient/rss_full.xml'),
('International : Royaume-Uni', 'https://www.lemonde.fr/royaume-uni/rss_full.xml'),
('International : Etats-Unis', 'https://www.lemonde.fr/etats-unis/rss_full.xml'),
('International : La une', 'https://www.lemonde.fr/international/rss_full.xml'),
('France : Politique ', 'https://www.lemonde.fr/politique/rss_full.xml'),
('France : Société ', 'https://www.lemonde.fr/societe/rss_full.xml'),
('France : Les décodeurs', 'https://www.lemonde.fr/les-decodeurs/rss_full.xml'),
('France : Justice ', 'https://www.lemonde.fr/justice/rss_full.xml'),
('France : Police ', 'https://www.lemonde.fr/police/rss_full.xml'),
('France : Campus ', 'https://www.lemonde.fr/campus/rss_full.xml'),
('France : Education', 'https://www.lemonde.fr/education/rss_full.xml'),
('Economie : Entreprises ', 'https://www.lemonde.fr/entreprises/rss_full.xml'),
('Economie : Argent ', 'https://www.lemonde.fr/argent/rss_full.xml'),
('Economie : Économie française', 'https://www.lemonde.fr/economie-francaise/rss_full.xml'),
('Economie : Industrie', 'https://www.lemonde.fr/industrie/rss_full.xml'),
('Economie : Emploi ', 'https://www.lemonde.fr/emploi/rss_full.xml'),
('Economie : Immobilier ', 'https://www.lemonde.fr/immobilier/rss_full.xml'),
('Economie : Médias', 'https://www.lemonde.fr/medias/rss_full.xml'),
('Economie : La une', 'https://www.lemonde.fr/economie/rss_full.xml'),
('Planète: Climat ', 'https://www.lemonde.fr/climat/rss_full.xml'),
('Planète: Agriculture ', 'https://www.lemonde.fr/agriculture/rss_full.xml'),
('Planète: Environnement', 'https://www.lemonde.fr/environnement/rss_full.xml'),
('Planète: La une', 'https://www.lemonde.fr/planete/rss_full.xml'),
('Sciences : Espace ', 'https://www.lemonde.fr/espace/rss_full.xml'),
('Sciences : Biologie ', 'https://www.lemonde.fr/biologie/rss_full.xml'),
('Sciences : Médecine ', 'https://www.lemonde.fr/medecine/rss_full.xml'),
('Sciences : Physique ', 'https://www.lemonde.fr/physique/rss_full.xml'),
('Sciences : Santé', 'https://www.lemonde.fr/sante/rss_full.xml'),
('Sciences : La une', 'https://www.lemonde.fr/sciences/rss_full.xml'),
('Culture : Cinéma ', 'https://www.lemonde.fr/cinema/rss_full.xml'),
('Culture : Musiques ', 'https://www.lemonde.fr/musiques/rss_full.xml'),
('Culture : Télévision et radio', 'https://www.lemonde.fr/televisions-radio/rss_full.xml'),
('Culture : Le Monde des livres', 'https://www.lemonde.fr/livres/rss_full.xml'),
('Culture : Arts ', 'https://www.lemonde.fr/arts/rss_full.xml'),
('Culture : Scènes', 'https://www.lemonde.fr/scenes/rss_full.xml'),
('Culture : La une', 'https://www.lemonde.fr/culture/rss_full.xml'),
('Opinions : La une', 'https://www.lemonde.fr/idees/rss_full.xml'),
('Opinions : éditoriaux', 'https://www.lemonde.fr/editoriaux/rss_full.xml'),
('Opinions : chroniques ', 'https://www.lemonde.fr/chroniques/rss_full.xml'),
('Opinions : tribunes', 'https://www.lemonde.fr/tribunes/rss_full.xml'),
('Pixels : Jeux vidéo', 'https://www.lemonde.fr/jeux-video/rss_full.xml'),
('Pixels : Culture web', 'https://www.lemonde.fr/cultures-web/rss_full.xml'),
('Pixels : La une', 'https://www.lemonde.fr/pixels/rss_full.xml'),
('Sport : Football ', 'https://www.lemonde.fr/football/rss_full.xml'),
('Sport : Rugby ', 'https://www.lemonde.fr/rugby/rss_full.xml'),
('Sport : Tennis ', 'https://www.lemonde.fr/tennis/rss_full.xml'),
('Sport : Cyclisme ', 'https://www.lemonde.fr/cyclisme/rss_full.xml'),
('Sport : Basket', 'https://www.lemonde.fr/basket/rss_full.xml'),
('Sport : La une', 'https://www.lemonde.fr/sport/rss_full.xml'),
('M le mag : L’époque ', 'https://www.lemonde.fr/m-perso/rss_full.xml'),
('M le mag : Styles ', 'https://www.lemonde.fr/m-styles/rss_full.xml'),
('M le mag : Gastronomie ', 'https://www.lemonde.fr/gastronomie/rss_full.xml'),
('M le mag : Recettes du Monde', 'https://www.lemonde.fr/les-recettes-du-monde/rss_full.xml'),
('M le mag : Sexo', 'https://www.lemonde.fr/sexo/rss_full.xml'),
('M le mag : La une', 'https://www.lemonde.fr/m-le-mag/rss_full.xml'),
('Actualités : A la une', 'https://www.lemonde.fr/rss/une.xml'),
('Actualités : En continu', 'https://www.lemonde.fr/rss/en_continu.xml'),
('Actualités : Vidéos ', 'https://www.lemonde.fr/videos/rss_full.xml'),
('Actualités : Portfolios', 'https://www.lemonde.fr/photo/rss_full.xml'),
]

keep_only_tags = [
classes('article__header'),
dict(name='section', attrs={'class': ['article__cover', 'article__content', 'article__heading',
'article__wrapper']})
]

remove_tags = [
classes('article__status meta__reading-time meta__social multimedia-embed'),
dict(name=['footer', 'link']),
dict(name='img', attrs={'class': ['article__author-picture']}),
dict(name='section', attrs={'class': ['inread js-services-inread', 'catcher catcher--inline', 'inread inread--NL js-services-inread', 'article__reactions', 'author', 'catcher', 'portfolio', 'services-inread']})
]

remove_attributes = [
'data-sizes', 'height', 'sizes', 'width'
]

preprocess_regexps = [
# insert space between author name and description
(re.compile(r'(<span class="[^"]*author__desc[^>]*>)([^<]*</span>)',
re.IGNORECASE), lambda match: match.group(1) + ' ' + match.group(2)),
# insert " | " between article type and description
(re.compile(r'(<span class="[^"]*article__kicker[^>]*>[^<]*)(</span>)',
re.IGNORECASE), lambda match: match.group(1) + ' | ' + match.group(2))
]

extra_css = '''
h2 { font-size: 1em; }
h3 { font-size: 1em; }
.article__desc { font-weight: bold; }
.article__fact { font-weight: bold; text-transform: uppercase; }
.article__kicker { text-transform: uppercase; }
.article__legend { font-size: 0.6em; margin-bottom: 1em; }
.article__title { margin-top: 0em; }
'''

def get_browser(self):
br = BasicNewsRecipe.get_browser(self)
if self.username is not None and self.password is not None:
try:
br.open('https://secure.lemonde.fr/sfuser/connexion')
br.select_form(nr=0)
br['email'] = self.username
br['password'] = self.password
br.submit()
except Exception as e:
self.log('Login failed with error:', str(e))
return br

def get_cover_url(self):
# today's date is a reasonable guess for the ID of the cover
cover_id = date.today().strftime('%Y%m%d')
soup = self.index_to_soup('https://www.lemonde.fr/')
a = soup.find('a', {'id': 'jelec_link', 'style': True})
if a and a['style']:
url = a['style'].split('/')
if len(url) > 5 and url[3].isdigit():
# overwrite guess if actual cover ID was found
cover_id = url[3]
return 'https://www.lemonde.fr/thumbnail/journal/' + cover_id + '/1000/1490'

def get_article_url(self, article):
url = BasicNewsRecipe.get_article_url(self, article)
# skip articles without relevant content (e.g., videos)
for el in 'blog chat live podcasts portfolio video visuel'.split():
if '/' + el + '/' in url:
self.log('Skipping URL', url)
self.abort_article()
return url

def preprocess_html(self, soup):
# when an image is available in multiple sizes, select the smallest one
for img in soup.find_all('img', {'data-srcset': True}):
print ("IMGDDPYL0 = ", img)
data_srcset = img['data-srcset'].split()
print ("IMGDDPYL1 = ", data_srcset)
if len(data_srcset) > 1:
img['src'] = data_srcset[-2]
print("IMGDDPYL2 = " ,img['src'])
del img['data-srcset']
return soup

def postprocess_html(self, soup, first_fetch):
# remove local hyperlinks
for a in soup.find_all('a', {'href': True}):
if '.lemonde.fr/' in a['href']:
a.replace_with(self.tag_to_string(a))
# clean up header
for ul in soup.find_all('ul', {'class': 'breadcrumb'}):
div = soup.new_tag('div')
category = ''
for li in ul.find_all('li', {'class': True}):
category += self.tag_to_string(li).strip().upper() + ' - '
div.string = category[:-3]
ul.replace_with(div)
return soup


calibre_most_common_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'

Last edited by Villard; 01-06-2024 at 03:02 AM.
Villard is offline   Reply With Quote
Old 01-22-2024, 12:08 PM   #7
Teebob
Junior Member
Teebob began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jan 2024
Location: France
Device: Kindle Scribe
Hello and thanks for sharing the code. I tried many times in many different ways. I am still facing the issue of the recipe producing a super large file (47Mb). It looks like it may continue to extract old articles maybe? I cannot even load the file to my kindle. I tried to locate the piece of code that takes out the old articles. But couldnt find it. Maybe you can give me a hint?

The other strange issue that I have is after running the recipe, it crashes the website lemonde.fr for about an hour !! I have an error 406.
Teebob is offline   Reply With Quote
Old 01-23-2024, 03:58 AM   #8
unkn0wn
Evangelist
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 445
Karma: 82686
Join Date: May 2021
Device: kindle
@villard recipe should have been shared in [ CODE ] tags.
share your recipe file here, i'll try to fix.
unkn0wn is offline   Reply With Quote
Old 01-23-2024, 02:33 PM   #9
Teebob
Junior Member
Teebob began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jan 2024
Location: France
Device: Kindle Scribe
Well. It actually does that when running the default recipe (the one called "lemonde edition abonnés" - i made a copy below).
I just tried just now and it crashed again provoking that weird error 406 on lemonde.fr

However i noticed when I run the other lemonde recipe (the basic one for non subscribers) then it works. And if I put my password I get the full articles. But I think its only for a limited number of feeds.

Code:
#!/usr/bin/env python
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:fdm=marker:ai
from __future__ import absolute_import, division, print_function, unicode_literals

__author__ = 'S. Durand <sylvaindurand@users.noreply.github.com>'
__license__ = 'GPL v3'

'''
lemonde.fr
'''

from calibre.web.feeds.news import BasicNewsRecipe, classes
from datetime import date
import re


class LeMondeNumerique(BasicNewsRecipe):
    title = 'Le Monde: Édition abonnés'
    __author__ = 'Sylvain Durand'
    description = 'La version numérique du quotidien Le Monde'
    publisher = 'Société Editrice du Monde'
    publication_type = 'newspaper'
    needs_subscription = True
    language = 'fr'

    no_stylesheets = True
    ignore_duplicate_articles = {'title', 'url'}

    conversion_options = {
        'publisher': publisher
    }

    masthead_url = 'http://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Le_monde_logo.svg/800px-Le_monde_logo.svg.png'

    lm_sections = [
        'international:International',
        'politique:Politique',
        'societe:Société',
        'economie:Éco',
        'culture:Culture',
        'idees:Idées',
        'planete:Planète',
        'sport:Sport',
        'sciences:Sciences',
        'pixels:Pixels',
        'campus:Campus'
    ]

    keep_only_tags = [
        classes('article__header'),
        dict(name='section', attrs={'class': ['article__content', 'article__heading',
                                              'article__wrapper']})
    ]

    remove_tags = [
        classes('article__status meta__date meta__reading-time meta__social multimedia-embed'),
        dict(name=['footer', 'link']),
        dict(name='img', attrs={'class': ['article__author-picture']}),
        dict(name='section', attrs={'class': ['article__reactions', 'author', 'catcher',
                                              'portfolio', 'services-inread']})
    ]

    remove_attributes = [
        'data-sizes', 'height', 'sizes', 'width'
    ]

    preprocess_regexps = [
        # insert space between author name and description
        (re.compile(r'(<span class="[^"]*author__desc[^>]*>)([^<]*</span>)',
                    re.IGNORECASE), lambda match: match.group(1) + ' ' + match.group(2)),
        # insert " | " between article type and description
        (re.compile(r'(<span class="[^"]*article__kicker[^>]*>[^<]*)(</span>)',
                    re.IGNORECASE), lambda match: match.group(1) + ' | ' + match.group(2))
    ]

    extra_css = '''
        h2 { font-size: 1em; }
        h3 { font-size: 1em; }
        .article__desc { font-weight: bold; }
        .article__fact { font-weight: bold; text-transform: uppercase; }
        .article__kicker { text-transform: uppercase; }
        .article__legend { font-size: 0.6em; margin-bottom: 1em; }
        .article__title { margin-top: 0em; }
    '''

    def get_browser(self):
        br = BasicNewsRecipe.get_browser(self)
        if self.username is not None and self.password is not None:
            try:
                br.open('https://secure.lemonde.fr/sfuser/connexion')
                br.select_form(nr=0)
                br['email'] = self.username
                br['password'] = self.password
                br.submit()
            except Exception as e:
                self.log('Login failed with error:', str(e))
        return br

    def get_cover_url(self):
        # today's date is a reasonable guess for the ID of the cover
        cover_id = date.today().strftime('%Y%m%d')
        soup = self.index_to_soup('https://www.lemonde.fr/')
        a = soup.find('a', {'id': 'jelec_link', 'style': True})
        if a and a['style']:
            url = a['style'].split('/')
            if len(url) > 5 and url[3].isdigit():
                # overwrite guess if actual cover ID was found
                cover_id = url[3]
        return 'https://www.lemonde.fr/thumbnail/journal/' + cover_id + '/1000/1490'

    def parse_index(self):
        ans = []
        for x in self.lm_sections:
            s, section_title = x.partition(':')[::2]
            self.log('Processing section', section_title, '...')
            articles = list(self.parse_section('https://www.lemonde.fr/%s/' % s))
            if articles:
                ans.append((section_title, articles))
        return ans

    def parse_section(self, url):
        soup = self.index_to_soup(url)
        for article in soup.find_all('section', {'class': 'teaser'}):
            # extract URL
            a = article.find('a', {'class': 'teaser__link', 'href': True})
            if a is None:
                continue
            url = a['href']
            # skip articles without relevant content (e.g., videos)
            for el in 'blog chat live newsletters podcasts portfolio video visuel'.split():
                if '/' + el + '/' in url:
                    url = None
                    break
            if url is None:
                continue
            # extract title
            h3 = article.find('h3', {'class': 'teaser__title'})
            if h3 is None:
                continue
            title = self.tag_to_string(h3)
            # extract description
            desc = ''
            p = article.find('p', {'class': 'teaser__desc'})
            if p is not None:
                desc = self.tag_to_string(p)
            self.log('\tFound article', title, 'at', url)
            yield {'title': title, 'url': url, 'description': desc}

    def preprocess_html(self, soup):
        # when an image is available in multiple sizes, select the smallest one
        for img in soup.find_all('img', {'data-srcset': True}):
            data_srcset = img['data-srcset'].split()
            if len(data_srcset) > 1:
                img['src'] = data_srcset[-2]
                del img['data-srcset']
        return soup

    def postprocess_html(self, soup, first_fetch):
        # remove local hyperlinks
        for a in soup.find_all('a', {'href': True}):
            if '.lemonde.fr/' in a['href']:
                a.replace_with(self.tag_to_string(a))
        # clean up header
        for ul in soup.find_all('ul', {'class': 'breadcrumb'}):
            div = soup.new_tag('div')
            category = ''
            for li in ul.find_all('li', {'class': True}):
                category += self.tag_to_string(li).strip().upper() + ' - '
                div.string = category[:-3]
            ul.replace_with(div)
        return soup


calibre_most_common_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'
Teebob is offline   Reply With Quote
Old 01-24-2024, 02:09 AM   #10
unkn0wn
Evangelist
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 445
Karma: 82686
Join Date: May 2021
Device: kindle
i actually asked for the @villards recipe, hoping that you fixed it with indents and all, and tried it. You thanked him for sharing code?
The default recipe hasn't been updated to match his recipe.

If you think your attached recipe works, you can just substitute def parse_index with feeds list from villards and add oldest_article = 1 to get all sections.
unkn0wn is offline   Reply With Quote
Old 01-24-2024, 11:03 AM   #11
Villard
Connoisseur
Villard began at the beginning.
 
Posts: 64
Karma: 10
Join Date: May 2016
Device: Koreader running on Kobo Libra 2
Hello
I give you my recipe. It works fine for me every day. You need of course to subscribe to Le Monde and to enter your account identifiers inside the ebook-convert.exe command.

I know I've to share several recipes I'm working on and publish them for a future Calibre version. Sorry not to have done it yet.


Code:
#!/usr/bin/env python
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals

__license__ = 'GPL v3'
__copyright__ = '2012'



from calibre.web.feeds.news import BasicNewsRecipe, classes
from datetime import date
import re


class LeMonde(BasicNewsRecipe):
    title = 'Le Monde'
    __author__ = 'Martin Villard'
    description = 'Les flux RSS du Monde.fr'
    publisher = 'Société Editrice du Monde'
    publication_type = 'newspaper'
    needs_subscription = 'optional'
    language = 'fr'


    oldest_article = 1
    no_stylesheets = False
    remove_empty_feeds = True
    ignore_duplicate_articles = {'title', 'url'}
    reverse_article_order = True


    conversion_options = {
        'publisher': publisher
    }

    masthead_url = 'http://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Le_monde_logo.svg/800px-Le_monde_logo.svg.png'

    feeds = [
('International : Europe ', 'https://www.lemonde.fr/europe/rss_full.xml'),
('International : Amériques ', 'https://www.lemonde.fr/ameriques/rss_full.xml'),
('International : Afrique ', 'https://www.lemonde.fr/afrique/rss_full.xml'),
('International : Asie Pacifique', 'https://www.lemonde.fr/asie-pacifique/rss_full.xml'),
('International : Proche-Orient', 'https://www.lemonde.fr/proche-orient/rss_full.xml'),
('International : Royaume-Uni', 'https://www.lemonde.fr/royaume-uni/rss_full.xml'),
('International : Etats-Unis', 'https://www.lemonde.fr/etats-unis/rss_full.xml'),
('International : La une', 'https://www.lemonde.fr/international/rss_full.xml'),
('France : Politique ', 'https://www.lemonde.fr/politique/rss_full.xml'),
('France : Société ', 'https://www.lemonde.fr/societe/rss_full.xml'),
('France : Les décodeurs', 'https://www.lemonde.fr/les-decodeurs/rss_full.xml'),
('France : Justice ', 'https://www.lemonde.fr/justice/rss_full.xml'),
('France : Police ', 'https://www.lemonde.fr/police/rss_full.xml'),
('France : Campus ', 'https://www.lemonde.fr/campus/rss_full.xml'),
('France : Education', 'https://www.lemonde.fr/education/rss_full.xml'),
('Economie : Entreprises ', 'https://www.lemonde.fr/entreprises/rss_full.xml'),
('Economie : Argent ', 'https://www.lemonde.fr/argent/rss_full.xml'),
('Economie : Économie française', 'https://www.lemonde.fr/economie-francaise/rss_full.xml'),
('Economie : Industrie', 'https://www.lemonde.fr/industrie/rss_full.xml'),
('Economie : Emploi ', 'https://www.lemonde.fr/emploi/rss_full.xml'),
('Economie : Immobilier ', 'https://www.lemonde.fr/immobilier/rss_full.xml'),
('Economie : Médias', 'https://www.lemonde.fr/medias/rss_full.xml'),
('Economie : La une', 'https://www.lemonde.fr/economie/rss_full.xml'),
('Planète: Climat ', 'https://www.lemonde.fr/climat/rss_full.xml'),
('Planète: Agriculture ', 'https://www.lemonde.fr/agriculture/rss_full.xml'),
('Planète: Environnement', 'https://www.lemonde.fr/environnement/rss_full.xml'),
('Planète: La une', 'https://www.lemonde.fr/planete/rss_full.xml'),
('Sciences : Espace ', 'https://www.lemonde.fr/espace/rss_full.xml'),
('Sciences : Biologie ', 'https://www.lemonde.fr/biologie/rss_full.xml'),
('Sciences : Médecine ', 'https://www.lemonde.fr/medecine/rss_full.xml'),
('Sciences : Physique ', 'https://www.lemonde.fr/physique/rss_full.xml'),
('Sciences : Santé', 'https://www.lemonde.fr/sante/rss_full.xml'),
('Sciences : La une', 'https://www.lemonde.fr/sciences/rss_full.xml'),
('Culture : Cinéma ', 'https://www.lemonde.fr/cinema/rss_full.xml'),
('Culture : Musiques ', 'https://www.lemonde.fr/musiques/rss_full.xml'),
('Culture : Télévision et radio', 'https://www.lemonde.fr/televisions-radio/rss_full.xml'),
('Culture : Le Monde des livres', 'https://www.lemonde.fr/livres/rss_full.xml'),
('Culture : Arts ', 'https://www.lemonde.fr/arts/rss_full.xml'),
('Culture : Scènes', 'https://www.lemonde.fr/scenes/rss_full.xml'),
('Culture : La une', 'https://www.lemonde.fr/culture/rss_full.xml'),
('Opinions : La une', 'https://www.lemonde.fr/idees/rss_full.xml'),
('Opinions : éditoriaux', 'https://www.lemonde.fr/editoriaux/rss_full.xml'),
('Opinions : chroniques ', 'https://www.lemonde.fr/chroniques/rss_full.xml'),
('Opinions : tribunes', 'https://www.lemonde.fr/tribunes/rss_full.xml'),
('Pixels : Jeux vidéo', 'https://www.lemonde.fr/jeux-video/rss_full.xml'),
('Pixels : Culture web', 'https://www.lemonde.fr/cultures-web/rss_full.xml'),
('Pixels : La une', 'https://www.lemonde.fr/pixels/rss_full.xml'),
('Sport : Football ', 'https://www.lemonde.fr/football/rss_full.xml'),
('Sport : Rugby ', 'https://www.lemonde.fr/rugby/rss_full.xml'),
('Sport : Tennis ', 'https://www.lemonde.fr/tennis/rss_full.xml'),
('Sport : Cyclisme ', 'https://www.lemonde.fr/cyclisme/rss_full.xml'),
('Sport : Basket', 'https://www.lemonde.fr/basket/rss_full.xml'),
('Sport : La une', 'https://www.lemonde.fr/sport/rss_full.xml'),
('M le mag : L’époque ', 'https://www.lemonde.fr/m-perso/rss_full.xml'),
('M le mag : Styles ', 'https://www.lemonde.fr/m-styles/rss_full.xml'),
('M le mag : Gastronomie ', 'https://www.lemonde.fr/gastronomie/rss_full.xml'),
('M le mag : Recettes du Monde', 'https://www.lemonde.fr/les-recettes-du-monde/rss_full.xml'),
('M le mag : Sexo', 'https://www.lemonde.fr/sexo/rss_full.xml'),
('M le mag : La une', 'https://www.lemonde.fr/m-le-mag/rss_full.xml'),
('Actualités : A la une', 'https://www.lemonde.fr/rss/une.xml'),
('Actualités : En continu', 'https://www.lemonde.fr/rss/en_continu.xml'),
('Actualités : Vidéos ', 'https://www.lemonde.fr/videos/rss_full.xml'),
('Actualités : Portfolios', 'https://www.lemonde.fr/photo/rss_full.xml'),       

    ]

    keep_only_tags = [
        classes('article__header'),
        dict(name='section', attrs={'class': ['article__cover', 'article__content', 'article__heading',
                                              'article__wrapper']})
    ]

    remove_tags = [
        classes('article__status meta__reading-time meta__social multimedia-embed'),
        dict(name=['footer', 'link']),
        dict(name='img', attrs={'class': ['article__author-picture']}),
        dict(name='section', attrs={'class': ['inread js-services-inread', 'catcher catcher--inline', 'inread inread--NL js-services-inread', 'article__reactions', 'author', 'catcher', 'portfolio', 'services-inread']})
    ]

    remove_attributes = [
        'data-sizes', 'height', 'sizes', 'width'
    ]

    preprocess_regexps = [
        # insert space between author name and description
        (re.compile(r'(<span class="[^"]*author__desc[^>]*>)([^<]*</span>)',
                    re.IGNORECASE), lambda match: match.group(1) + ' ' + match.group(2)),
        # insert " | " between article type and description
        (re.compile(r'(<span class="[^"]*article__kicker[^>]*>[^<]*)(</span>)',
                    re.IGNORECASE), lambda match: match.group(1) + ' | ' + match.group(2))
    ]

    extra_css = '''
        h2 { font-size: 1em; }
        h3 { font-size: 1em; }
        .article__desc { font-weight: bold; }
        .article__fact { font-weight: bold; text-transform: uppercase; }
        .article__kicker { text-transform: uppercase; }
        .article__legend { font-size: 0.6em; margin-bottom: 1em; }
        .article__title { margin-top: 0em; }
    '''

    def get_browser(self):
        br = BasicNewsRecipe.get_browser(self)
        if self.username is not None and self.password is not None:
            try:
                br.open('https://secure.lemonde.fr/sfuser/connexion')
                br.select_form(nr=0)
                br['email'] = self.username
                br['password'] = self.password
                br.submit()
            except Exception as e:
                self.log('Login failed with error:', str(e))
        return br

    def get_cover_url(self):
        # today's date is a reasonable guess for the ID of the cover
        cover_id = date.today().strftime('%Y%m%d')
        soup = self.index_to_soup('https://www.lemonde.fr/')
        a = soup.find('a', {'id': 'jelec_link', 'style': True})
        if a and a['style']:
            url = a['style'].split('/')
            if len(url) > 5 and url[3].isdigit():
                # overwrite guess if actual cover ID was found
                cover_id = url[3]
        return 'https://www.lemonde.fr/thumbnail/journal/' + cover_id + '/1000/1490'

    def get_article_url(self, article):
        url = BasicNewsRecipe.get_article_url(self, article)
        # skip articles without relevant content (e.g., videos)
        for el in 'blog chat live podcasts portfolio video visuel'.split():
            if '/' + el + '/' in url:
                self.log('Skipping URL', url)
                self.abort_article()
        return url
    
    
    def preprocess_html(self, soup):
        # when an image is available in multiple sizes, select the smallest one
        for img in soup.find_all('img', {'data-srcset': True}):
            print ("IMGDDPYL0 = ", img)
            data_srcset = img['data-srcset'].split()
            print ("IMGDDPYL1 = ", data_srcset)
            if len(data_srcset) > 1:
                img['src'] = data_srcset[-2]
                print("IMGDDPYL2 = " ,img['src'])                
                del img['data-srcset']
        return soup
        
    def postprocess_html(self, soup, first_fetch):
        # remove local hyperlinks
        for a in soup.find_all('a', {'href': True}):
            if '.lemonde.fr/' in a['href']:
                a.replace_with(self.tag_to_string(a))
        # clean up header
        for ul in soup.find_all('ul', {'class': 'breadcrumb'}):
            div = soup.new_tag('div')
            category = ''
            for li in ul.find_all('li', {'class': True}):
                category += self.tag_to_string(li).strip().upper() + ' - '
                div.string = category[:-3]
            ul.replace_with(div)
        return soup


calibre_most_common_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'

Last edited by Villard; 01-24-2024 at 11:17 AM.
Villard is offline   Reply With Quote
Old 01-25-2024, 10:12 AM   #12
unkn0wn
Evangelist
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 445
Karma: 82686
Join Date: May 2021
Device: kindle
https://github.com/unkn0w7n/calibre/...cbd59a35fb3793
unkn0wn is offline   Reply With Quote
Old 01-27-2024, 04:08 AM   #13
Teebob
Junior Member
Teebob began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jan 2024
Location: France
Device: Kindle Scribe
Hi
Thanks Villard.
Its been now a couple of days I am using your new recipe. Works fine!
Teebob is offline   Reply With Quote
Old 01-31-2024, 11:16 PM   #14
Muller
Member
Muller began at the beginning.
 
Posts: 23
Karma: 10
Join Date: Mar 2018
Device: Kindle oasis
Quote:
Originally Posted by Villard View Post
Bonjour
Désolé de n'avoir pas répondu. Je ne découvre qu'aujourd'hui votre post; Ci-dessous la recette que j'utilise et qui fonctionne bien. Je dois effectivement la partager !

J'ai listé tous les fils RSS du Monde ! Vous pouvez supprimer les fils qui ne vous intéressent pas.
#!/usr/bin/env python
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2012'

from calibre.web.feeds.news import BasicNewsRecipe, classes
from datetime import date
import re

class LeMonde(BasicNewsRecipe):
title = 'Le Monde'
__author__ = 'Martin Villard'
description = 'Les flux RSS du Monde.fr'
publisher = 'Société Editrice du Monde'
publication_type = 'newspaper'
needs_subscription = 'optional'
language = 'fr'

oldest_article = 1
no_stylesheets = False
remove_empty_feeds = True
ignore_duplicate_articles = {'title', 'url'}
reverse_article_order = True

conversion_options = {
'publisher': publisher
}

masthead_url = 'http://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Le_monde_logo.svg/800px-Le_monde_logo.svg.png'

feeds = [
('International : Europe ', 'https://www.lemonde.fr/europe/rss_full.xml'),
('International : Amériques ', 'https://www.lemonde.fr/ameriques/rss_full.xml'),
('International : Afrique ', 'https://www.lemonde.fr/afrique/rss_full.xml'),
('International : Asie Pacifique', 'https://www.lemonde.fr/asie-pacifique/rss_full.xml'),
('International : Proche-Orient', 'https://www.lemonde.fr/proche-orient/rss_full.xml'),
('International : Royaume-Uni', 'https://www.lemonde.fr/royaume-uni/rss_full.xml'),
('International : Etats-Unis', 'https://www.lemonde.fr/etats-unis/rss_full.xml'),
('International : La une', 'https://www.lemonde.fr/international/rss_full.xml'),
('France : Politique ', 'https://www.lemonde.fr/politique/rss_full.xml'),
('France : Société ', 'https://www.lemonde.fr/societe/rss_full.xml'),
('France : Les décodeurs', 'https://www.lemonde.fr/les-decodeurs/rss_full.xml'),
('France : Justice ', 'https://www.lemonde.fr/justice/rss_full.xml'),
('France : Police ', 'https://www.lemonde.fr/police/rss_full.xml'),
('France : Campus ', 'https://www.lemonde.fr/campus/rss_full.xml'),
('France : Education', 'https://www.lemonde.fr/education/rss_full.xml'),
('Economie : Entreprises ', 'https://www.lemonde.fr/entreprises/rss_full.xml'),
('Economie : Argent ', 'https://www.lemonde.fr/argent/rss_full.xml'),
('Economie : Économie française', 'https://www.lemonde.fr/economie-francaise/rss_full.xml'),
('Economie : Industrie', 'https://www.lemonde.fr/industrie/rss_full.xml'),
('Economie : Emploi ', 'https://www.lemonde.fr/emploi/rss_full.xml'),
('Economie : Immobilier ', 'https://www.lemonde.fr/immobilier/rss_full.xml'),
('Economie : Médias', 'https://www.lemonde.fr/medias/rss_full.xml'),
('Economie : La une', 'https://www.lemonde.fr/economie/rss_full.xml'),
('Planète: Climat ', 'https://www.lemonde.fr/climat/rss_full.xml'),
('Planète: Agriculture ', 'https://www.lemonde.fr/agriculture/rss_full.xml'),
('Planète: Environnement', 'https://www.lemonde.fr/environnement/rss_full.xml'),
('Planète: La une', 'https://www.lemonde.fr/planete/rss_full.xml'),
('Sciences : Espace ', 'https://www.lemonde.fr/espace/rss_full.xml'),
('Sciences : Biologie ', 'https://www.lemonde.fr/biologie/rss_full.xml'),
('Sciences : Médecine ', 'https://www.lemonde.fr/medecine/rss_full.xml'),
('Sciences : Physique ', 'https://www.lemonde.fr/physique/rss_full.xml'),
('Sciences : Santé', 'https://www.lemonde.fr/sante/rss_full.xml'),
('Sciences : La une', 'https://www.lemonde.fr/sciences/rss_full.xml'),
('Culture : Cinéma ', 'https://www.lemonde.fr/cinema/rss_full.xml'),
('Culture : Musiques ', 'https://www.lemonde.fr/musiques/rss_full.xml'),
('Culture : Télévision et radio', 'https://www.lemonde.fr/televisions-radio/rss_full.xml'),
('Culture : Le Monde des livres', 'https://www.lemonde.fr/livres/rss_full.xml'),
('Culture : Arts ', 'https://www.lemonde.fr/arts/rss_full.xml'),
('Culture : Scènes', 'https://www.lemonde.fr/scenes/rss_full.xml'),
('Culture : La une', 'https://www.lemonde.fr/culture/rss_full.xml'),
('Opinions : La une', 'https://www.lemonde.fr/idees/rss_full.xml'),
('Opinions : éditoriaux', 'https://www.lemonde.fr/editoriaux/rss_full.xml'),
('Opinions : chroniques ', 'https://www.lemonde.fr/chroniques/rss_full.xml'),
('Opinions : tribunes', 'https://www.lemonde.fr/tribunes/rss_full.xml'),
('Pixels : Jeux vidéo', 'https://www.lemonde.fr/jeux-video/rss_full.xml'),
('Pixels : Culture web', 'https://www.lemonde.fr/cultures-web/rss_full.xml'),
('Pixels : La une', 'https://www.lemonde.fr/pixels/rss_full.xml'),
('Sport : Football ', 'https://www.lemonde.fr/football/rss_full.xml'),
('Sport : Rugby ', 'https://www.lemonde.fr/rugby/rss_full.xml'),
('Sport : Tennis ', 'https://www.lemonde.fr/tennis/rss_full.xml'),
('Sport : Cyclisme ', 'https://www.lemonde.fr/cyclisme/rss_full.xml'),
('Sport : Basket', 'https://www.lemonde.fr/basket/rss_full.xml'),
('Sport : La une', 'https://www.lemonde.fr/sport/rss_full.xml'),
('M le mag : L’époque ', 'https://www.lemonde.fr/m-perso/rss_full.xml'),
('M le mag : Styles ', 'https://www.lemonde.fr/m-styles/rss_full.xml'),
('M le mag : Gastronomie ', 'https://www.lemonde.fr/gastronomie/rss_full.xml'),
('M le mag : Recettes du Monde', 'https://www.lemonde.fr/les-recettes-du-monde/rss_full.xml'),
('M le mag : Sexo', 'https://www.lemonde.fr/sexo/rss_full.xml'),
('M le mag : La une', 'https://www.lemonde.fr/m-le-mag/rss_full.xml'),
('Actualités : A la une', 'https://www.lemonde.fr/rss/une.xml'),
('Actualités : En continu', 'https://www.lemonde.fr/rss/en_continu.xml'),
('Actualités : Vidéos ', 'https://www.lemonde.fr/videos/rss_full.xml'),
('Actualités : Portfolios', 'https://www.lemonde.fr/photo/rss_full.xml'),
]

keep_only_tags = [
classes('article__header'),
dict(name='section', attrs={'class': ['article__cover', 'article__content', 'article__heading',
'article__wrapper']})
]

remove_tags = [
classes('article__status meta__reading-time meta__social multimedia-embed'),
dict(name=['footer', 'link']),
dict(name='img', attrs={'class': ['article__author-picture']}),
dict(name='section', attrs={'class': ['inread js-services-inread', 'catcher catcher--inline', 'inread inread--NL js-services-inread', 'article__reactions', 'author', 'catcher', 'portfolio', 'services-inread']})
]

remove_attributes = [
'data-sizes', 'height', 'sizes', 'width'
]

preprocess_regexps = [
# insert space between author name and description
(re.compile(r'(<span class="[^"]*author__desc[^>]*>)([^<]*</span>)',
re.IGNORECASE), lambda match: match.group(1) + ' ' + match.group(2)),
# insert " | " between article type and description
(re.compile(r'(<span class="[^"]*article__kicker[^>]*>[^<]*)(</span>)',
re.IGNORECASE), lambda match: match.group(1) + ' | ' + match.group(2))
]

extra_css = '''
h2 { font-size: 1em; }
h3 { font-size: 1em; }
.article__desc { font-weight: bold; }
.article__fact { font-weight: bold; text-transform: uppercase; }
.article__kicker { text-transform: uppercase; }
.article__legend { font-size: 0.6em; margin-bottom: 1em; }
.article__title { margin-top: 0em; }
'''

def get_browser(self):
br = BasicNewsRecipe.get_browser(self)
if self.username is not None and self.password is not None:
try:
br.open('https://secure.lemonde.fr/sfuser/connexion')
br.select_form(nr=0)
br['email'] = self.username
br['password'] = self.password
br.submit()
except Exception as e:
self.log('Login failed with error:', str(e))
return br

def get_cover_url(self):
# today's date is a reasonable guess for the ID of the cover
cover_id = date.today().strftime('%Y%m%d')
soup = self.index_to_soup('https://www.lemonde.fr/')
a = soup.find('a', {'id': 'jelec_link', 'style': True})
if a and a['style']:
url = a['style'].split('/')
if len(url) > 5 and url[3].isdigit():
# overwrite guess if actual cover ID was found
cover_id = url[3]
return 'https://www.lemonde.fr/thumbnail/journal/' + cover_id + '/1000/1490'

def get_article_url(self, article):
url = BasicNewsRecipe.get_article_url(self, article)
# skip articles without relevant content (e.g., videos)
for el in 'blog chat live podcasts portfolio video visuel'.split():
if '/' + el + '/' in url:
self.log('Skipping URL', url)
self.abort_article()
return url

def preprocess_html(self, soup):
# when an image is available in multiple sizes, select the smallest one
for img in soup.find_all('img', {'data-srcset': True}):
print ("IMGDDPYL0 = ", img)
data_srcset = img['data-srcset'].split()
print ("IMGDDPYL1 = ", data_srcset)
if len(data_srcset) > 1:
img['src'] = data_srcset[-2]
print("IMGDDPYL2 = " ,img['src'])
del img['data-srcset']
return soup

def postprocess_html(self, soup, first_fetch):
# remove local hyperlinks
for a in soup.find_all('a', {'href': True}):
if '.lemonde.fr/' in a['href']:
a.replace_with(self.tag_to_string(a))
# clean up header
for ul in soup.find_all('ul', {'class': 'breadcrumb'}):
div = soup.new_tag('div')
category = ''
for li in ul.find_all('li', {'class': True}):
category += self.tag_to_string(li).strip().upper() + ' - '
div.string = category[:-3]
ul.replace_with(div)
return soup


calibre_most_common_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'
Bonjour cher Villard,

Merci beaucoup pour votre réponse et votre partage. Cela fonctionne très bien pour moi !
Muller is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Le Monde édition papier Matthieu V Recipes 0 01-20-2022 02:34 PM
Le Monde Edition abonné does not work LE_MEC Recipes 41 08-31-2019 10:11 AM
Folha de Sao Paulo [Printed edition] recipe broken William_M_S Recipes 24 10-24-2017 04:36 AM
"Le monde: édition abonnés" broken recipe Acryde Recipes 2 02-15-2017 04:17 AM
Recipe for "Le monde édition abonné" darkl Recipes 13 02-19-2013 10:04 PM


All times are GMT -4. The time now is 03:52 PM.


MobileRead.com is a privately owned, operated and funded community.