MobileRead Forums - View Single Post

AlessandroD92 · 04-15-2019, 03:35 AM

Thank you very much for your code. However, I have noticed that there is no limitation on the oldest articles that can be downloaded. As far as I am concerned, I would like to download articles that are no older than one day. I did some research online and modified the code.
I would like to emphasize that I have no experience at all at coding Calibre recipes. I just googled some strings of code and everything seems working perfectly. In short, I have modified the filter of the article URL. For example, if you want to look for articles of today 15th April 2019, the URL of the article should start with https://www.ilpost.it/2019/04/15. Therefore, the code below downloads the articles of the current day and the day before by looking at their URLs.
Once again, I am not an expert and I don't know the meaning of all the lines of the code. I just use my intuition to update the code. If you find a better approach to download articles no older than a specific number of days, please let me know

Code:

#!/usr/bin/env python2
##
# Title:        Il Post recipe for calibre
# Author:       Marco Scirea, based on a recipe by frafra
# Contact:      marco.prolog at gmail.com
##
# License:      GNU General Public License v3 - http://www.gnu.org/copyleft/gpl.html
# Copyright:    Copyright 2019 Marco Scirea
##

from __future__ import absolute_import, division, print_function, unicode_literals
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.utils.magick import Image
from datetime import datetime, timedelta
yesterday = datetime.now() - timedelta(days=1)
# ----------- CUSTOMIZATION OPTIONS START -----------

# Comment (add # in front) to disable the sections you are not interested in
# Commenta (aggiungi # davanti alla riga) per disabilitare le sezioni che non vuoi scaricare
sections = [
    ("Prima Pagina", "https://www.ilpost.it/"),
    ("Italia", "https://www.ilpost.it/italia/"),
    ("Mondo", "https://www.ilpost.it/mondo/"),
    ("Politica", "https://www.ilpost.it/politica/"),
    ("Tecnologia", "https://www.ilpost.it/tecnologia/"),
    ("Internet", "https://www.ilpost.it/internet/"),
    ("Scienza", "https://www.ilpost.it/scienza/"),
    ("Cultura", "https://www.ilpost.it/cultura/"),
    ("Economia", "https://www.ilpost.it/economia/"),
    ("Sport", "https://www.ilpost.it/sport/"),
    ("Media", "https://www.ilpost.it/media/"),
    ("Moda", "https://www.ilpost.it/moda/"),
    ("Libri", "https://www.ilpost.it/libri/"),
    ("Auto", "https://www.ilpost.it/auto/"),
    ("Konrad", "https://www.ilpost.it/europa/"),
]

# Change this to True if you want grey images
convert_to_grayscale = True

# ----------- CUSTOMIZATION OPTIONS OVER -----------

prefixes = {"Permalink to", "Commenta", "Link all'articolo"}


class IlPost(BasicNewsRecipe):
    __author__ = 'Marco Scirea'
    __license__ = 'GPL v3'
    __copyright__ = '2019, Marco Scirea <marco.prolog at gmail.com>'

    title = "Il Post"
    language = "it"
    description = (
        'Puoi decidere quali sezioni scaricare modificando la ricetta.'
        ' Di default le immagini sono convertite in scala di grigio per risparmiare spazio,'
        ' la ricetta puo\' essere configurata per tenerle a colori'
    )
    tags = "news"
    cover_url = "https://www.ilpost.it/wp-content/themes/ilpost/images/ilpost.svg"
    ignore_duplicate_articles = {"title", "url"}
    no_stylesheets = True
    keep_only_tags = [dict(id=["expanding", "singleBody"])]

    def parse_page(self, name, url):
        self.log.debug(url)
        soup = self.index_to_soup(url)
        entries = []
        for article in soup.findAll('article'):
            for link in article.findAll('a', href=True, title=True):
                if not link["href"].startswith("https://www.ilpost.it"+ time.strftime('/%Y/%m/%d')) | link["href"].startswith("https://www.ilpost.it"+ yesterday.strftime('/%Y/%m/%d')):
                    continue
                title = link["title"]
                for prefix in prefixes:
                    if title.startswith(prefix):
                        title = title.lstrip(prefix)
                        break
                title = title.strip()
                entries.append({
                    "url": link["href"],
                    "title": title,
                })
        return (name, entries)

    def populate_article_metadata(self, article, soup, first):
        description = soup.find(attrs={"name": "description"})
        article.summary = description[
            "content"] if description else "No meta description given"
        article.text_summary = description[
            "content"] if description else "No meta description given"

    def parse_index(self):
        feeds = []
        for section in sections:
            feeds.append(self.parse_page(section[0], section[1]))
        return feeds

    def postprocess_html(self, soup, first):
        if convert_to_grayscale:
            #process all the images
            for tag in soup.findAll(lambda tag: tag.name.lower()=='img' and tag.has_key('src')):
                iurl = tag['src']
                img = Image()
                img.open(iurl)
                if img < 0:
                    raise RuntimeError('Out of memory')
                img.type = "GrayscaleType"
                img.save(iurl)
        return soup