View Single Post
Old 10-15-2020, 12:24 AM   #1
gourav
Member
gourav doesn't littergourav doesn't litter
 
Posts: 14
Karma: 132
Join Date: Aug 2014
Device: Kindle Paperwhite 7th Gen
Creating a recipe for theprint.in

I was creating a recipe for theprint.in, a relatively new but high quality news website from India. I started with a fully automated recipe and then started customizing it.

What I primarily needed was the remove_tags and auto_cleanup_keep functionality. However, while remove_tags got working readily, I'm not able to make the auto_cleanup_keep work.

What I'm trying to do here is to keep the name of the author, the publication time, and the subtitle of the post which the auto cleanup algo is removing by default. Can anyone help me make this work.

I'm using Calibre version 5.2.0. Here's the recipe:
Code:
#!/usr/bin/env python3
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1600702839(BasicNewsRecipe):
    title          = f'The Print - {time.strftime("%d %b, %Y", time.localtime())}'
    description    = "News from The Print, an independent, digital only media outlet"
    publication_type = 'newspaper'
    language       = 'en'
    oldest_article = 2
    max_articles_per_feed = 15
    auto_cleanup_keep = '//div[@class="td-module-meta-info"]|'\
                        '//h2[@class="td-post-sub-title"]|'\
                        '//a[@class="author url fn"]|'\
                        '//span[@class="update_date"]'
    auto_cleanup   = True
    ignore_duplicate_articles = {'url'}
    
    remove_tags    = [dict(name='div', attrs={'class':['post_contribute', 'code-block code-block-11']}),
                     dict(attrs={'class': 'fontsize_Btn'}),
                     dict(name='p', attrs={'class': 'postBtm'}),
                     dict(name='em'), dict(name='hr'), dict(name='button')]

    feeds          = [
        ('Politics', 'https://theprint.in/category/politics/feed/'),
        ('Governance', 'https://theprint.in/category/india/governance/feed/'),
        ('Economy', 'https://theprint.in/category/economy/feed'),
        ('India', 'https://theprint.in/category/india/feed'),
        ('Opinion', 'https://theprint.in/category/opinion/feed'),
        ('Defence', 'https://theprint.in/category/defence/feed'),
        ('Science', 'https://theprint.in/category/science/feed/'),
        ('Tech', 'https://theprint.in/category/tech/feed/'),
        ('Education', 'https://theprint.in/category/india/education/feed/'),
        ('National Interest', 'https://theprint.in/category/national-interest/feed/'),
        ('50-word Edit', 'https://theprint.in/category/50-word-edit/feed/'),
        ('Ilanomics', 'https://theprint.in/ilanomics/feed/'),
        ('Diplomacy', 'https://theprint.in/category/diplomacy/feed/'),
        ('Features', 'https://theprint.in/category/features/feed/')
    ]
gourav is offline   Reply With Quote