MobileRead Forums - View Single Post - Calibre news recipe fetching articles from knappily.

PatStapleton · 08-20-2020, 07:00 AM

Ok this is working although a couple of minor bugs which I haven't bothered addressing as I've spent as much time as I'd like to for now:
- "\n" characters appear and I haven't been able to remove them as such
- there can be duplicate articles as they sometimes appear under more than one feed e.g. "Latest" and "Technology" (perhaps just remove the "Latest" feed by commenting out if you prefer)

Code:

#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
__license__ = 'GPL v3'
__copyright__ = '2020, Pat Stapleton <pat.stapleton at gmail.com>'
'''
Recipe for Knappily
'''
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.web.feeds import Feed
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class Knappily(BasicNewsRecipe):
    title          = 'Knappily'
    language       = 'en'
    __author__     = 'Pat Stapleton'
    description = 'One-stop solution for all the major issues ranging from politics, economy, business, sports to technology and law  to make people “a subject matter expert in 2 minutes”.'
    oldest_article = 7 #days
    max_articles_per_feed = 100
    publication_type = 'digital magazine'

    use_embedded_content = False # if set to true will assume that all the article content is within the feed (i.e. won't try to fetch more data)

    feeds          = [
        ('Latest Knapps', 'https://feeds.feedburner.com/knappily-latestknapps'),
        ('Sports', 'https://feeds.feedburner.com/knappily-sports'),
        ('Politics', 'https://feeds.feedburner.com/knappily-politics'),
        ('World', 'https://feeds.feedburner.com/knappily-world'),
        ('Society', 'https://feeds.feedburner.com/knappily-society'),
        ('Environment', 'https://feeds.feedburner.com/knappily-environment'),
        ('Business', 'https://feeds.feedburner.com/knappily-business'),
        ('Technology', 'https://feeds.feedburner.com/knappily-technology'),
        ('Budget', 'https://feeds.feedburner.com/knappily-budget'),
        ('On This Day', 'https://feeds.feedburner.com/knappily-onthisday'),
        ('Ethics', 'https://feeds.feedburner.com/knappily-ethics'),
        ('This!', 'https://feeds.feedburner.com/knappily-this'),
    ]

    #javascript loads the article data from another url to prevent scraping/parsing
    def get_article_url(self, article):
        url = article['link']
        article_id = url[url.rindex("/")+1:len(url)]
        raw_data_url = "https://services.knappily.com/article?id=" + article_id
        return raw_data_url

    def preprocess_html(self, soup):        
        #run through sections and cleanup raw data
        article_list = soup.body.contents
        if(len(article_list) <= 1): self.abort_article()#skip the strange empty single image articles
        idx = 0
        for article in article_list:
            if("_id" in article): del article_list[idx] #unwanted item
            idx = idx + 1
        
        #cleanup and add intro to beginning of article
        article_title = article_list.pop().rsplit('"title":')[-1] #discard last item, but it also contains the title which we can grab
        article_title = article_title[0:len(article_title)-2]#drop last 2 characters as they are closing curly braces
        intro_section = article_list.pop() #intro is 2nd last in the list, move it to the front
        article_list.insert(0, intro_section)
        heading_tag = soup.new_tag("h2") #now add title to front
        heading_tag.string = article_title
        article_list.insert(0, heading_tag)
        return soup