Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 01-01-2019, 07:02 AM   #1
marco.prolog
Junior Member
marco.prolog began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jan 2019
Device: Kindle Oasis
New recipe for "Il Post"

This is an updated recipe for "Il Post" (Italian newspaper) based on Frafra's recipe.
Improvements:
  • Downloads articles divided in sections, by scraping the different URLs of the website
  • Allows for customization of what sections to download (by editing the recipe)
  • Properly populates the article's descriptions
  • Converts all images to gray-scale (there is an option to disable this, in case you're using a color e-reader)
  • Ignore "bits" (very short articles) and most photo galleries
  • Fixed downloading of duplicate articles
  • Fixed occasional article title starting with "Link to..."

Spoiler:
Code:
#!/usr/bin/env python2
##
# Title:        Il Post recipe for calibre
# Author:       Marco Scirea, based on a recipe by frafra
# Contact:      marco.prolog at gmail.com
##
# License:      GNU General Public License v3 - http://www.gnu.org/copyleft/gpl.html
# Copyright:    Copyright 2019 Marco Scirea
##

from __future__ import absolute_import, division, print_function, unicode_literals
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.utils.magick import Image

#----------- CUSTOMIZATION OPTIONS START -----------

# Comment (add # in front) to disable the sections you are not interested in
# Commenta (aggiungi # davanti alla riga) per disabilitare le sezioni che non vuoi scaricare
sections = [
    ("Prima Pagina", "https://www.ilpost.it/"),
    ("Mondo", "https://www.ilpost.it/mondo/"),
    ("Politica", "https://www.ilpost.it/politica/"),
    ("Tecnologia", "https://www.ilpost.it/tecnologia/"),
    ("Internet", "https://www.ilpost.it/internet/")
    ("Scienza", "https://www.ilpost.it/scienza/"),
    ("Cultura", "https://www.ilpost.it/cultura/"),
    ("Economia", "https://www.ilpost.it/economia/"),
    ("Sport", "https://www.ilpost.it/sport/"),
    ("Media", "https://www.ilpost.it/media/"),
    ("Moda", "https://www.ilpost.it/moda/"),
    ("Libri", "https://www.ilpost.it/libri/"),
    ("Auto", "https://www.ilpost.it/auto/"),
    ("Konrad", "https://www.ilpost.it/europa/"),
]

# Change this to False if you want color images (e.g. if you're reading on a Kindle Fire)
convert_to_grayscale = True

#----------- CUSTOMIZATION OPTIONS OVER -----------

prefixes = {"Permalink to", "Commenta", "Link all'articolo"}

class IlPost(BasicNewsRecipe):
    __author__ = 'Marco Scirea, based on a recipe by frafra'
    __license__ = 'GPL v3'
    __copyright__ = '2019, Marco Scirea <marco.prolog at gmail.com>'
    
    title = "Il Post"
    language = "it"
    description = 'Puoi decidere quali sezioni scaricare modificando la ricetta. Di default le immagini sono convertite in scala di grigio per risparmiare spazio, la ricetta puo\' essere configurata per tenerle a colori'
    tags = "news"
    cover_url = "https://www.ilpost.it/wp-content/themes/ilpost/images/ilpost.svg"
    ignore_duplicate_articles = {"title","url"}
    no_stylesheets = True
    keep_only_tags = [dict(id=["expanding", "singleBody"])]
            
    def parse_page(self, name, url):
        self.log.debug(url)
        soup = self.index_to_soup(url)
        entries = []
        for article in soup.findAll('article'):
            for link in article.findAll('a', href=True, title=True):
                if not link["href"].startswith("https://www.ilpost.it/20"):
                    continue
                title = link["title"]
                for prefix in prefixes:
                    if title.startswith(prefix):
                        title = title.lstrip(prefix)
                        break
                title = title.strip()
                entries.append({
                    "url": link["href"],
                    "title": title,
                })
        return (name, entries)
    
    def populate_article_metadata(self, article, soup, first):
        description = soup.find(attrs={"name":"description"})
        article.summary = description["content"] if description else "No meta description given"
        article.text_summary = description["content"] if description else "No meta description given"

    def parse_index(self):
        feeds = []
        #feeds.append(self.parse_page("Front Page", "https://www.ilpost.it/"))
        for section in sections:
            feeds.append(self.parse_page(section[0], section[1]))
        return feeds
    
    # Image conversion to greyscale by Starson17
    # https://www.mobileread.com/forums/showpost.php?p=1814815&postcount=15
    def postprocess_html(self, soup, first):
        if convert_to_grayscale:
            #process all the images
            for tag in soup.findAll(lambda tag: tag.name.lower()=='img' and tag.has_key('src')):
                iurl = tag['src']
                img = Image()
                img.open(iurl)
                if img < 0:
                    raise RuntimeError('Out of memory')
                img.type = "GrayscaleType"
                img.save(iurl)
        return soup


I couldn't find any mention to how to add a custom icon to the recipe (I'm guessing it can only be done for built-in recipes); anyway I attach an icon that could be used if it is accepted in the built-in recipes, and the script in file format for easy downloading

Happy new year!
Attached Images
 
Attached Files
File Type: recipe Il Post.recipe (4.2 KB, 267 views)

Last edited by marco.prolog; 01-01-2019 at 07:04 AM.
marco.prolog is offline   Reply With Quote
Old 01-02-2019, 12:12 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Thanks, https://github.com/kovidgoyal/calibr...f8da0db3710285
kovidgoyal is offline   Reply With Quote
Old 01-19-2019, 04:39 AM   #3
marco.prolog
Junior Member
marco.prolog began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jan 2019
Device: Kindle Oasis
Hi Kovid,

I added a check to remove the few photo galleries that were still scraped.
Also I noticed that in the version on the Calibre repository the lines are:
Code:
...
if convert_to_grayscale:
    def postprocess_html(self, soup, first):
...
while they should be
Code:
...
def postprocess_html(self, soup, first):
    if convert_to_grayscale:
...
Spoiler:

#!/usr/bin/env python2
##
# Title: Il Post recipe for calibre
# Author: Marco Scirea, based on a recipe by frafra
# Contact: marco.prolog at gmail.com
##
# License: GNU General Public License v3 - http://www.gnu.org/copyleft/gpl.html
# Copyright: Copyright 2019 Marco Scirea
##

from __future__ import absolute_import, division, print_function, unicode_literals
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.utils.magick import Image

# ----------- CUSTOMIZATION OPTIONS START -----------

# Comment (add # in front) to disable the sections you are not interested in
# Commenta (aggiungi # davanti alla riga) per disabilitare le sezioni che non vuoi scaricare
sections = [
("Prima Pagina", "https://www.ilpost.it/"),
("Mondo", "https://www.ilpost.it/mondo/"),
("Politica", "https://www.ilpost.it/politica/"),
("Tecnologia", "https://www.ilpost.it/tecnologia/"),
("Internet", "https://www.ilpost.it/internet/"),
("Scienza", "https://www.ilpost.it/scienza/"),
("Cultura", "https://www.ilpost.it/cultura/"),
("Economia", "https://www.ilpost.it/economia/"),
("Sport", "https://www.ilpost.it/sport/"),
("Media", "https://www.ilpost.it/media/"),
("Moda", "https://www.ilpost.it/moda/"),
("Libri", "https://www.ilpost.it/libri/"),
("Auto", "https://www.ilpost.it/auto/"),
("Konrad", "https://www.ilpost.it/europa/"),
]

# Change this to True if you want grey images
convert_to_grayscale = True

# ----------- CUSTOMIZATION OPTIONS OVER -----------

prefixes = {"Permalink to", "Commenta", "Link all'articolo"}


class IlPost(BasicNewsRecipe):
__author__ = 'Marco Scirea'
__license__ = 'GPL v3'
__copyright__ = '2019, Marco Scirea <marco.prolog at gmail.com>'

title = "Il Post"
language = "it"
description = ('Puoi decidere quali sezioni scaricare modificando la ricetta.'
' Di default le immagini sono convertite in scala di grigio per risparmiare spazio,'
' la ricetta puo\' essere configurata per tenerle a colori')
tags = "news"
cover_url = "https://www.ilpost.it/wp-content/themes/ilpost/images/ilpost.svg"
ignore_duplicate_articles = {"title", "url"}
no_stylesheets = True
keep_only_tags = [dict(id=["expanding", "singleBody"])]

def parse_page(self, name, url):
self.log.debug(url)
soup = self.index_to_soup(url)
entries = []
for article in soup.findAll('article'):
for link in article.findAll('a', href=True, title=True):
if not link["href"].startswith("https://www.ilpost.it/20"):
continue
title = link["title"]
for prefix in prefixes:
if title.startswith(prefix):
title = title.lstrip(prefix)
break
title = title.strip()
entries.append({
"url": link["href"],
"title": title,
})
return (name, entries)

def populate_article_metadata(self, article, soup, first):
description = soup.find(attrs={"name": "description"})
article.summary = description[
"content"] if description else "No meta description given"
article.text_summary = description[
"content"] if description else "No meta description given"

def parse_index(self):
feeds = []
for section in sections:
feeds.append(self.parse_page(section[0], section[1]))
return feeds


# Image conversion to greyscale by Starson17
# https://www.mobileread.com/forums/sh...5&postcount=15
def postprocess_html(self, soup, first):
if convert_to_grayscale:
# process all the images
for tag in soup.findAll('img', src=True):
iurl = tag['src']
img = Image()
img.open(iurl)
img.type = "GrayscaleType"
img.save(iurl)
return soup

def preprocess_html(self, soup):
galleryItems = soup.findAll("figure", { "class" : "gallery-item"} )
if (galleryItems):
abort_article()
return soup
Attached Files
File Type: recipe Il Post.recipe (4.3 KB, 151 views)
marco.prolog is offline   Reply With Quote
Old 01-20-2019, 12:09 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
The change to the if statement is deliberate, it has the same effect, but looks nicer to me
kovidgoyal is offline   Reply With Quote
Old 01-20-2019, 06:50 AM   #5
marco.prolog
Junior Member
marco.prolog began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jan 2019
Device: Kindle Oasis
Ok, sorry about that then Kovid
I didn't know that you could put function definitions inside if-statements in python :P
marco.prolog is offline   Reply With Quote
Old 01-20-2019, 07:14 AM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
yeah a classes are all created dynamically at runtime and class scope is basically like a function scope.
kovidgoyal is offline   Reply With Quote
Old 04-15-2019, 02:35 AM   #7
AlessandroD92
Junior Member
AlessandroD92 began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Apr 2019
Device: Kindle PaperWhite
code update

Thank you very much for your code. However, I have noticed that there is no limitation on the oldest articles that can be downloaded. As far as I am concerned, I would like to download articles that are no older than one day. I did some research online and modified the code.
I would like to emphasize that I have no experience at all at coding Calibre recipes. I just googled some strings of code and everything seems working perfectly. In short, I have modified the filter of the article URL. For example, if you want to look for articles of today 15th April 2019, the URL of the article should start with https://www.ilpost.it/2019/04/15. Therefore, the code below downloads the articles of the current day and the day before by looking at their URLs.
Once again, I am not an expert and I don't know the meaning of all the lines of the code. I just use my intuition to update the code. If you find a better approach to download articles no older than a specific number of days, please let me know

Code:
#!/usr/bin/env python2
##
# Title:        Il Post recipe for calibre
# Author:       Marco Scirea, based on a recipe by frafra
# Contact:      marco.prolog at gmail.com
##
# License:      GNU General Public License v3 - http://www.gnu.org/copyleft/gpl.html
# Copyright:    Copyright 2019 Marco Scirea
##

from __future__ import absolute_import, division, print_function, unicode_literals
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.utils.magick import Image
from datetime import datetime, timedelta
yesterday = datetime.now() - timedelta(days=1)
# ----------- CUSTOMIZATION OPTIONS START -----------

# Comment (add # in front) to disable the sections you are not interested in
# Commenta (aggiungi # davanti alla riga) per disabilitare le sezioni che non vuoi scaricare
sections = [
    ("Prima Pagina", "https://www.ilpost.it/"),
    ("Italia", "https://www.ilpost.it/italia/"),
    ("Mondo", "https://www.ilpost.it/mondo/"),
    ("Politica", "https://www.ilpost.it/politica/"),
    ("Tecnologia", "https://www.ilpost.it/tecnologia/"),
    ("Internet", "https://www.ilpost.it/internet/"),
    ("Scienza", "https://www.ilpost.it/scienza/"),
    ("Cultura", "https://www.ilpost.it/cultura/"),
    ("Economia", "https://www.ilpost.it/economia/"),
    ("Sport", "https://www.ilpost.it/sport/"),
    ("Media", "https://www.ilpost.it/media/"),
    ("Moda", "https://www.ilpost.it/moda/"),
    ("Libri", "https://www.ilpost.it/libri/"),
    ("Auto", "https://www.ilpost.it/auto/"),
    ("Konrad", "https://www.ilpost.it/europa/"),
]

# Change this to True if you want grey images
convert_to_grayscale = True

# ----------- CUSTOMIZATION OPTIONS OVER -----------

prefixes = {"Permalink to", "Commenta", "Link all'articolo"}


class IlPost(BasicNewsRecipe):
    __author__ = 'Marco Scirea'
    __license__ = 'GPL v3'
    __copyright__ = '2019, Marco Scirea <marco.prolog at gmail.com>'

    title = "Il Post"
    language = "it"
    description = (
        'Puoi decidere quali sezioni scaricare modificando la ricetta.'
        ' Di default le immagini sono convertite in scala di grigio per risparmiare spazio,'
        ' la ricetta puo\' essere configurata per tenerle a colori'
    )
    tags = "news"
    cover_url = "https://www.ilpost.it/wp-content/themes/ilpost/images/ilpost.svg"
    ignore_duplicate_articles = {"title", "url"}
    no_stylesheets = True
    keep_only_tags = [dict(id=["expanding", "singleBody"])]

    def parse_page(self, name, url):
        self.log.debug(url)
        soup = self.index_to_soup(url)
        entries = []
        for article in soup.findAll('article'):
            for link in article.findAll('a', href=True, title=True):
                if not link["href"].startswith("https://www.ilpost.it"+ time.strftime('/%Y/%m/%d')) | link["href"].startswith("https://www.ilpost.it"+ yesterday.strftime('/%Y/%m/%d')):
                    continue
                title = link["title"]
                for prefix in prefixes:
                    if title.startswith(prefix):
                        title = title.lstrip(prefix)
                        break
                title = title.strip()
                entries.append({
                    "url": link["href"],
                    "title": title,
                })
        return (name, entries)

    def populate_article_metadata(self, article, soup, first):
        description = soup.find(attrs={"name": "description"})
        article.summary = description[
            "content"] if description else "No meta description given"
        article.text_summary = description[
            "content"] if description else "No meta description given"

    def parse_index(self):
        feeds = []
        for section in sections:
            feeds.append(self.parse_page(section[0], section[1]))
        return feeds

    def postprocess_html(self, soup, first):
        if convert_to_grayscale:
            #process all the images
            for tag in soup.findAll(lambda tag: tag.name.lower()=='img' and tag.has_key('src')):
                iurl = tag['src']
                img = Image()
                img.open(iurl)
                if img < 0:
                    raise RuntimeError('Out of memory')
                img.type = "GrayscaleType"
                img.save(iurl)
        return soup
AlessandroD92 is offline   Reply With Quote
Old 04-15-2019, 03:39 AM   #8
AlessandroD92
Junior Member
AlessandroD92 began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Apr 2019
Device: Kindle PaperWhite
Moreover, in order to have the articles perfectly organized in sections, I would recommend modifying the first line of sections.

The line to be modified is
Code:
("Prima Pagina", "https://www.ilpost.it/"),
The modification to be applied is
Code:
("Prima Pagina", "https://www.ilpost.it/prime-pagine"),
The former code refers to the homepage, the latter to the section called "Prime Pagine". If we do not modify the former code, then some articles would belong to two different sections at the same time, one of which is (wrongly) the homepage. Obviously, all the newly published articles are on the homepage. As a consequence, when we run the recipe, all the new articles belong to "Prima Pagina", which is wrong in my opinion. I think that in "Prima Pagina" we should only see the articles belonging to this section:https://www.ilpost.it/prime-pagine.

This adjustment is very useful when you want to download only the articles no older than one day. If you don't modify the section, then almost all the downloaded articles will be located in the "Prima Pagina" section.
AlessandroD92 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Recipe for "The Pickering Post" Jilfan Recipes 2 01-19-2015 01:25 AM
Recipe for EPUB subscribers of "Tagesspiegel" and "Handelsblatt"? F.W. Recipes 0 05-14-2013 11:16 AM
New recipe for "Süddeutsche Zeitung" using "E-Paper mobile" subscription Ernst Recipes 3 02-16-2013 07:37 AM
Recipe for "Galicia Confidencial" and "De L a V" roebek Recipes 1 07-19-2011 09:17 AM


All times are GMT -4. The time now is 05:22 AM.


MobileRead.com is a privately owned, operated and funded community.