MobileRead Forums - View Single Post

marco.prolog · 01-01-2019, 08:02 AM

This is an updated recipe for "Il Post" (Italian newspaper) based on Frafra's recipe.
Improvements:

Downloads articles divided in sections, by scraping the different URLs of the website
Allows for customization of what sections to download (by editing the recipe)
Properly populates the article's descriptions
Converts all images to gray-scale (there is an option to disable this, in case you're using a color e-reader)
Ignore "bits" (very short articles) and most photo galleries
Fixed downloading of duplicate articles
Fixed occasional article title starting with "Link to..."

Spoiler:

Code:

#!/usr/bin/env python2
##
# Title:        Il Post recipe for calibre
# Author:       Marco Scirea, based on a recipe by frafra
# Contact:      marco.prolog at gmail.com
##
# License:      GNU General Public License v3 - http://www.gnu.org/copyleft/gpl.html
# Copyright:    Copyright 2019 Marco Scirea
##

from __future__ import absolute_import, division, print_function, unicode_literals
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.utils.magick import Image

#----------- CUSTOMIZATION OPTIONS START -----------

# Comment (add # in front) to disable the sections you are not interested in
# Commenta (aggiungi # davanti alla riga) per disabilitare le sezioni che non vuoi scaricare
sections = [
    ("Prima Pagina", "https://www.ilpost.it/"),
    ("Mondo", "https://www.ilpost.it/mondo/"),
    ("Politica", "https://www.ilpost.it/politica/"),
    ("Tecnologia", "https://www.ilpost.it/tecnologia/"),
    ("Internet", "https://www.ilpost.it/internet/")
    ("Scienza", "https://www.ilpost.it/scienza/"),
    ("Cultura", "https://www.ilpost.it/cultura/"),
    ("Economia", "https://www.ilpost.it/economia/"),
    ("Sport", "https://www.ilpost.it/sport/"),
    ("Media", "https://www.ilpost.it/media/"),
    ("Moda", "https://www.ilpost.it/moda/"),
    ("Libri", "https://www.ilpost.it/libri/"),
    ("Auto", "https://www.ilpost.it/auto/"),
    ("Konrad", "https://www.ilpost.it/europa/"),
]

# Change this to False if you want color images (e.g. if you're reading on a Kindle Fire)
convert_to_grayscale = True

#----------- CUSTOMIZATION OPTIONS OVER -----------

prefixes = {"Permalink to", "Commenta", "Link all'articolo"}

class IlPost(BasicNewsRecipe):
    __author__ = 'Marco Scirea, based on a recipe by frafra'
    __license__ = 'GPL v3'
    __copyright__ = '2019, Marco Scirea <marco.prolog at gmail.com>'
    
    title = "Il Post"
    language = "it"
    description = 'Puoi decidere quali sezioni scaricare modificando la ricetta. Di default le immagini sono convertite in scala di grigio per risparmiare spazio, la ricetta puo\' essere configurata per tenerle a colori'
    tags = "news"
    cover_url = "https://www.ilpost.it/wp-content/themes/ilpost/images/ilpost.svg"
    ignore_duplicate_articles = {"title","url"}
    no_stylesheets = True
    keep_only_tags = [dict(id=["expanding", "singleBody"])]
            
    def parse_page(self, name, url):
        self.log.debug(url)
        soup = self.index_to_soup(url)
        entries = []
        for article in soup.findAll('article'):
            for link in article.findAll('a', href=True, title=True):
                if not link["href"].startswith("https://www.ilpost.it/20"):
                    continue
                title = link["title"]
                for prefix in prefixes:
                    if title.startswith(prefix):
                        title = title.lstrip(prefix)
                        break
                title = title.strip()
                entries.append({
                    "url": link["href"],
                    "title": title,
                })
        return (name, entries)
    
    def populate_article_metadata(self, article, soup, first):
        description = soup.find(attrs={"name":"description"})
        article.summary = description["content"] if description else "No meta description given"
        article.text_summary = description["content"] if description else "No meta description given"

    def parse_index(self):
        feeds = []
        #feeds.append(self.parse_page("Front Page", "https://www.ilpost.it/"))
        for section in sections:
            feeds.append(self.parse_page(section[0], section[1]))
        return feeds
    
    # Image conversion to greyscale by Starson17
    # https://www.mobileread.com/forums/showpost.php?p=1814815&postcount=15
    def postprocess_html(self, soup, first):
        if convert_to_grayscale:
            #process all the images
            for tag in soup.findAll(lambda tag: tag.name.lower()=='img' and tag.has_key('src')):
                iurl = tag['src']
                img = Image()
                img.open(iurl)
                if img < 0:
                    raise RuntimeError('Out of memory')
                img.type = "GrayscaleType"
                img.save(iurl)
        return soup

I couldn't find any mention to how to add a custom icon to the recipe (I'm guessing it can only be done for built-in recipes); anyway I attach an icon that could be used if it is accepted in the built-in recipes, and the script in file format for easy downloading

Happy new year!