![]() |
#1 |
Junior Member
![]() Posts: 5
Karma: 10
Join Date: Jun 2020
Location: India
Device: kindle Basic 8th generation
|
Calibre news recipe fetching articles from knappily.
![]() ![]() |
![]() |
![]() |
![]() |
#2 |
Newsbeamer dev
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 123
Karma: 1000
Join Date: Dec 2011
Device: Kindle Voyage
|
[DELETED]
Last edited by duluoz; 06-09-2020 at 07:21 PM. |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Member
![]() Posts: 22
Karma: 10
Join Date: Nov 2011
Location: Australia
Device: Kindle 4
|
Hi,
I've made an attempt at this but it seems a bit of a tricky html to parse correctly - it all gets wrapped into multiple <div> tags for "what" "how" "why" "when" "where" segments. I'll have another go after August as I have a big exam coming up I need to prepare for. |
![]() |
![]() |
![]() |
#4 | |
Junior Member
![]() Posts: 5
Karma: 10
Join Date: Jun 2020
Location: India
Device: kindle Basic 8th generation
|
Quote:
![]() ![]() ![]() |
|
![]() |
![]() |
![]() |
#5 |
Member
![]() Posts: 22
Karma: 10
Join Date: Nov 2011
Location: Australia
Device: Kindle 4
|
Hi,
I haven't forgotten, my exams are done for the moment. Taking another look it seems each section ("how", "what", "why", "when", "where", "who") is populated using javascript calls. I'm trying to wrap my head around how to do that in a recipe, not sure if anybody else here has any experience or ideas with that? |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Member
![]() Posts: 22
Karma: 10
Join Date: Nov 2011
Location: Australia
Device: Kindle 4
|
![]()
Ok this is working although a couple of minor bugs which I haven't bothered addressing as I've spent as much time as I'd like to for now:
- "\n" characters appear and I haven't been able to remove them as such - there can be duplicate articles as they sometimes appear under more than one feed e.g. "Latest" and "Technology" (perhaps just remove the "Latest" feed by commenting out if you prefer) Code:
#!/usr/bin/env python2 # vim:fileencoding=utf-8 from __future__ import unicode_literals, division, absolute_import, print_function __license__ = 'GPL v3' __copyright__ = '2020, Pat Stapleton <pat.stapleton at gmail.com>' ''' Recipe for Knappily ''' from calibre.web.feeds.news import BasicNewsRecipe from calibre.web.feeds import Feed from calibre.ebooks.BeautifulSoup import BeautifulSoup class Knappily(BasicNewsRecipe): title = 'Knappily' language = 'en' __author__ = 'Pat Stapleton' description = 'One-stop solution for all the major issues ranging from politics, economy, business, sports to technology and law to make people “a subject matter expert in 2 minutes”.' oldest_article = 7 #days max_articles_per_feed = 100 publication_type = 'digital magazine' use_embedded_content = False # if set to true will assume that all the article content is within the feed (i.e. won't try to fetch more data) feeds = [ ('Latest Knapps', 'https://feeds.feedburner.com/knappily-latestknapps'), ('Sports', 'https://feeds.feedburner.com/knappily-sports'), ('Politics', 'https://feeds.feedburner.com/knappily-politics'), ('World', 'https://feeds.feedburner.com/knappily-world'), ('Society', 'https://feeds.feedburner.com/knappily-society'), ('Environment', 'https://feeds.feedburner.com/knappily-environment'), ('Business', 'https://feeds.feedburner.com/knappily-business'), ('Technology', 'https://feeds.feedburner.com/knappily-technology'), ('Budget', 'https://feeds.feedburner.com/knappily-budget'), ('On This Day', 'https://feeds.feedburner.com/knappily-onthisday'), ('Ethics', 'https://feeds.feedburner.com/knappily-ethics'), ('This!', 'https://feeds.feedburner.com/knappily-this'), ] #javascript loads the article data from another url to prevent scraping/parsing def get_article_url(self, article): url = article['link'] article_id = url[url.rindex("/")+1:len(url)] raw_data_url = "https://services.knappily.com/article?id=" + article_id return raw_data_url def preprocess_html(self, soup): #run through sections and cleanup raw data article_list = soup.body.contents if(len(article_list) <= 1): self.abort_article()#skip the strange empty single image articles idx = 0 for article in article_list: if("_id" in article): del article_list[idx] #unwanted item idx = idx + 1 #cleanup and add intro to beginning of article article_title = article_list.pop().rsplit('"title":')[-1] #discard last item, but it also contains the title which we can grab article_title = article_title[0:len(article_title)-2]#drop last 2 characters as they are closing curly braces intro_section = article_list.pop() #intro is 2nd last in the list, move it to the front article_list.insert(0, intro_section) heading_tag = soup.new_tag("h2") #now add title to front heading_tag.string = article_title article_list.insert(0, heading_tag) return soup |
![]() |
![]() |
![]() |
#7 | |
Junior Member
![]() Posts: 5
Karma: 10
Join Date: Jun 2020
Location: India
Device: kindle Basic 8th generation
|
Quote:
![]() ![]() ![]() |
|
![]() |
![]() |
![]() |
#8 |
Member
![]() Posts: 22
Karma: 10
Join Date: Nov 2011
Location: Australia
Device: Kindle 4
|
You're welcome! Hope you enjoy it.
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Calibre News Recipe not downloading multiple page articles | Raptor | Recipes | 0 | 02-03-2015 12:29 AM |
Associated Press recipe news title display broken within articles | duydangle | Recipes | 0 | 02-12-2014 12:19 AM |
Financial Times (UK) recipe no longer fetching all articles | piet8stevens | Recipes | 1 | 02-23-2013 04:15 AM |
Problem: Recipe for Foreign Affairs not fetching premium articles | besianm | Recipes | 1 | 03-07-2012 04:41 AM |
Reversing articles order in a custom news recipe? | retired_anon_25 | Calibre | 5 | 12-12-2009 05:24 PM |