Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 06-08-2020, 02:58 AM   #1
patoliadixit
Junior Member
patoliadixit began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Jun 2020
Location: India
Device: kindle Basic 8th generation
Calibre news recipe fetching articles from knappily.

I tried to fetch news from this rss feed http://feeds.feedburner.com/knappily-latestknapps. But article fetched were empty. When I looked at pagesource of articles I found that page source was not normal html, it was mainly <script> elements. I don't know about website developement or programming that much. So please help me to fetch news article from this site, https://knappily.com/rss
patoliadixit is offline   Reply With Quote
Old 06-09-2020, 07:17 PM   #2
duluoz
Newsbeamer dev
duluoz can extract oil from cheeseduluoz can extract oil from cheeseduluoz can extract oil from cheeseduluoz can extract oil from cheeseduluoz can extract oil from cheeseduluoz can extract oil from cheeseduluoz can extract oil from cheeseduluoz can extract oil from cheese
 
Posts: 123
Karma: 1000
Join Date: Dec 2011
Device: Kindle Voyage
[DELETED]

Last edited by duluoz; 06-09-2020 at 07:21 PM.
duluoz is offline   Reply With Quote
Advert
Old 06-12-2020, 08:47 PM   #3
PatStapleton
Member
PatStapleton began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Nov 2011
Location: Australia
Device: Kindle 4
Hi,

I've made an attempt at this but it seems a bit of a tricky html to parse correctly - it all gets wrapped into multiple <div> tags for "what" "how" "why" "when" "where" segments.
I'll have another go after August as I have a big exam coming up I need to prepare for.
PatStapleton is offline   Reply With Quote
Old 06-12-2020, 09:22 PM   #4
patoliadixit
Junior Member
patoliadixit began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Jun 2020
Location: India
Device: kindle Basic 8th generation
Quote:
Originally Posted by PatStapleton View Post
Hi,

I've made an attempt at this but it seems a bit of a tricky html to parse correctly - it all gets wrapped into multiple <div> tags for "what" "how" "why" "when" "where" segments.
I'll have another go after August as I have a big exam coming up I need to prepare for.
Thank you very much for helping me. I can't express how happy I am. If it is possible please share with me the recipe you have made so far. And once again thank you very much. Best of luck for your exams.
patoliadixit is offline   Reply With Quote
Old 08-19-2020, 03:55 AM   #5
PatStapleton
Member
PatStapleton began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Nov 2011
Location: Australia
Device: Kindle 4
Hi,

I haven't forgotten, my exams are done for the moment.

Taking another look it seems each section ("how", "what", "why", "when", "where", "who") is populated using javascript calls.

I'm trying to wrap my head around how to do that in a recipe, not sure if anybody else here has any experience or ideas with that?
PatStapleton is offline   Reply With Quote
Advert
Old 08-20-2020, 06:00 AM   #6
PatStapleton
Member
PatStapleton began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Nov 2011
Location: Australia
Device: Kindle 4
Post

Ok this is working although a couple of minor bugs which I haven't bothered addressing as I've spent as much time as I'd like to for now:
- "\n" characters appear and I haven't been able to remove them as such
- there can be duplicate articles as they sometimes appear under more than one feed e.g. "Latest" and "Technology" (perhaps just remove the "Latest" feed by commenting out if you prefer)

Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
__license__ = 'GPL v3'
__copyright__ = '2020, Pat Stapleton <pat.stapleton at gmail.com>'
'''
Recipe for Knappily
'''
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.web.feeds import Feed
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class Knappily(BasicNewsRecipe):
    title          = 'Knappily'
    language       = 'en'
    __author__     = 'Pat Stapleton'
    description = 'One-stop solution for all the major issues ranging from politics, economy, business, sports to technology and law  to make people “a subject matter expert in 2 minutes”.'
    oldest_article = 7 #days
    max_articles_per_feed = 100
    publication_type = 'digital magazine'

    use_embedded_content = False # if set to true will assume that all the article content is within the feed (i.e. won't try to fetch more data)

    feeds          = [
        ('Latest Knapps', 'https://feeds.feedburner.com/knappily-latestknapps'),
        ('Sports', 'https://feeds.feedburner.com/knappily-sports'),
        ('Politics', 'https://feeds.feedburner.com/knappily-politics'),
        ('World', 'https://feeds.feedburner.com/knappily-world'),
        ('Society', 'https://feeds.feedburner.com/knappily-society'),
        ('Environment', 'https://feeds.feedburner.com/knappily-environment'),
        ('Business', 'https://feeds.feedburner.com/knappily-business'),
        ('Technology', 'https://feeds.feedburner.com/knappily-technology'),
        ('Budget', 'https://feeds.feedburner.com/knappily-budget'),
        ('On This Day', 'https://feeds.feedburner.com/knappily-onthisday'),
        ('Ethics', 'https://feeds.feedburner.com/knappily-ethics'),
        ('This!', 'https://feeds.feedburner.com/knappily-this'),
    ]

    #javascript loads the article data from another url to prevent scraping/parsing
    def get_article_url(self, article):
        url = article['link']
        article_id = url[url.rindex("/")+1:len(url)]
        raw_data_url = "https://services.knappily.com/article?id=" + article_id
        return raw_data_url

    def preprocess_html(self, soup):        
        #run through sections and cleanup raw data
        article_list = soup.body.contents
        if(len(article_list) <= 1): self.abort_article()#skip the strange empty single image articles
        idx = 0
        for article in article_list:
            if("_id" in article): del article_list[idx] #unwanted item
            idx = idx + 1
        
        #cleanup and add intro to beginning of article
        article_title = article_list.pop().rsplit('"title":')[-1] #discard last item, but it also contains the title which we can grab
        article_title = article_title[0:len(article_title)-2]#drop last 2 characters as they are closing curly braces
        intro_section = article_list.pop() #intro is 2nd last in the list, move it to the front
        article_list.insert(0, intro_section)
        heading_tag = soup.new_tag("h2") #now add title to front
        heading_tag.string = article_title
        article_list.insert(0, heading_tag)
        return soup
PatStapleton is offline   Reply With Quote
Old 09-18-2020, 10:34 PM   #7
patoliadixit
Junior Member
patoliadixit began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Jun 2020
Location: India
Device: kindle Basic 8th generation
Quote:
Originally Posted by PatStapleton View Post
Ok this is working although a couple of minor bugs which I haven't bothered addressing as I've spent as much time as I'd like to for now:
- "\n" characters appear and I haven't been able to remove them as such
- there can be duplicate articles as they sometimes appear under more than one feed e.g. "Latest" and "Technology" (perhaps just remove the "Latest" feed by commenting out if you prefer)

Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
__license__ = 'GPL v3'
__copyright__ = '2020, Pat Stapleton <pat.stapleton at gmail.com>'
'''
Recipe for Knappily
'''
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.web.feeds import Feed
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class Knappily(BasicNewsRecipe):
    title          = 'Knappily'
    language       = 'en'
    __author__     = 'Pat Stapleton'
    description = 'One-stop solution for all the major issues ranging from politics, economy, business, sports to technology and law  to make people “a subject matter expert in 2 minutes”.'
    oldest_article = 7 #days
    max_articles_per_feed = 100
    publication_type = 'digital magazine'

    use_embedded_content = False # if set to true will assume that all the article content is within the feed (i.e. won't try to fetch more data)

    feeds          = [
        ('Latest Knapps', 'https://feeds.feedburner.com/knappily-latestknapps'),
        ('Sports', 'https://feeds.feedburner.com/knappily-sports'),
        ('Politics', 'https://feeds.feedburner.com/knappily-politics'),
        ('World', 'https://feeds.feedburner.com/knappily-world'),
        ('Society', 'https://feeds.feedburner.com/knappily-society'),
        ('Environment', 'https://feeds.feedburner.com/knappily-environment'),
        ('Business', 'https://feeds.feedburner.com/knappily-business'),
        ('Technology', 'https://feeds.feedburner.com/knappily-technology'),
        ('Budget', 'https://feeds.feedburner.com/knappily-budget'),
        ('On This Day', 'https://feeds.feedburner.com/knappily-onthisday'),
        ('Ethics', 'https://feeds.feedburner.com/knappily-ethics'),
        ('This!', 'https://feeds.feedburner.com/knappily-this'),
    ]

    #javascript loads the article data from another url to prevent scraping/parsing
    def get_article_url(self, article):
        url = article['link']
        article_id = url[url.rindex("/")+1:len(url)]
        raw_data_url = "https://services.knappily.com/article?id=" + article_id
        return raw_data_url

    def preprocess_html(self, soup):        
        #run through sections and cleanup raw data
        article_list = soup.body.contents
        if(len(article_list) <= 1): self.abort_article()#skip the strange empty single image articles
        idx = 0
        for article in article_list:
            if("_id" in article): del article_list[idx] #unwanted item
            idx = idx + 1
        
        #cleanup and add intro to beginning of article
        article_title = article_list.pop().rsplit('"title":')[-1] #discard last item, but it also contains the title which we can grab
        article_title = article_title[0:len(article_title)-2]#drop last 2 characters as they are closing curly braces
        intro_section = article_list.pop() #intro is 2nd last in the list, move it to the front
        article_list.insert(0, intro_section)
        heading_tag = soup.new_tag("h2") #now add title to front
        heading_tag.string = article_title
        article_list.insert(0, heading_tag)
        return soup
Thank you very much!!! I spent much time on this recipe but was unable to make it. But you really did it!!!! Thanks again....
patoliadixit is offline   Reply With Quote
Old 09-23-2020, 06:10 AM   #8
PatStapleton
Member
PatStapleton began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Nov 2011
Location: Australia
Device: Kindle 4
You're welcome! Hope you enjoy it.
PatStapleton is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Calibre News Recipe not downloading multiple page articles Raptor Recipes 0 02-03-2015 12:29 AM
Associated Press recipe news title display broken within articles duydangle Recipes 0 02-12-2014 12:19 AM
Financial Times (UK) recipe no longer fetching all articles piet8stevens Recipes 1 02-23-2013 04:15 AM
Problem: Recipe for Foreign Affairs not fetching premium articles besianm Recipes 1 03-07-2012 04:41 AM
Reversing articles order in a custom news recipe? retired_anon_25 Calibre 5 12-12-2009 05:24 PM


All times are GMT -4. The time now is 02:35 AM.


MobileRead.com is a privately owned, operated and funded community.