![]() |
#1 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 615
Karma: 85520
Join Date: May 2021
Device: kindle
|
Help: outlook magazine India
present recipe doesn't work anymore.
new recipe: Code:
import json from calibre.web.feeds.news import BasicNewsRecipe class outlook(BasicNewsRecipe): title = 'Outlook Magazine' __author__ = 'unkn0wn' description = '' language = 'en_IN' use_embedded_content = False no_stylesheets = True remove_javascript = True remove_attributes = ['height', 'width', 'style'] ignore_duplicate_articles = {'url'} def parse_index(self): soup = self.index_to_soup('https://www.outlookindia.com/magazine/archive') issue = soup.find(**classes('issue_listing')) a = issue.find('a', href=lambda x: x and x.startswith('/magazine/issue/')) url = a['href'] self.log('Downloading issue:', url) self.cover_url = a.find('img', attrs={'src': True})['src'] soup = self.index_to_soup('https://www.outlookindia.com' + url) ans = [] for h3 in soup.findAll(['h3', 'h4'], attrs={'class':'tk-kepler-std-condensed-subhead'}): a = h3.find('a', href = lambda x: x) url = a['href'] title = self.tag_to_string(a) desc = h3.find_next_sibling('p') desc = self.tag_to_string(desc) self.log('\t\tFound article:', title) self.log('\t\t\t', url) self.log('\t\t\t\t', desc) ans.append({ 'title': title, 'url': url, 'description': desc}) return [('Articles', ans)] soup = self.index_to_soup(raw) script = soup.find('script', type="application/ld+json") example json from outlook. (save as json) Spoiler:
i think its really simple json but i dont know how to extract and convert to html.. Last edited by unkn0wn; 04-30-2022 at 04:04 AM. |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,330
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 615
Karma: 85520
Join Date: May 2021
Device: kindle
|
![]() |
![]() |
![]() |
![]() |
#4 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 615
Karma: 85520
Join Date: May 2021
Device: kindle
|
Financial Times has feeds..
and the json is also very similar to above.. Code:
import json, re from calibre.web.feeds.news import BasicNewsRecipe class ft(BasicNewsRecipe): title = 'Financial Times' language = 'en' __author__ = "Kovid Goyal" description = 'The Financial Times is one of the world’s leading news organisations, recognised internationally for its authority, integrity and accuracy.' oldest_article = 1.5 max_articles_per_feed = 50 no_stylesheets = True remove_javascript = True ignore_duplicate_articles = {'url'} remove_attributes = ['style', 'width', 'height'] def get_cover_url(self): soup = self.index_to_soup('https://www.todayspapers.co.uk/the-financial-times-front-page-today/') tag = soup.find('div', attrs={'class': 'elementor-image'}) if tag: self.cover_url = tag.find('img')['src'] return getattr(self, 'cover_url', self.cover_url) feeds = [ ('World', 'https://www.ft.com/world?format=rss'), ('US', 'https://www.ft.com/world?format=rss'), ('Companies', 'https://www.ft.com/companies?format=rss'), ('Tech', 'https://www.ft.com/technology?format=rss'), ('Markets', 'https://www.ft.com/companies?format=rss'), ('Climate', 'https://www.ft.com/climate-capital?format=rss'), ('Opinion', 'https://www.ft.com/opinion?format=rss'), ('Life & Arts', 'https://www.ft.com/life-arts?format=rss'), ('how to spend it', 'https://www.ft.com/htsi?format=rss'), ] calibre_most_common_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36' |
![]() |
![]() |
![]() |
#5 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,330
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 615
Karma: 85520
Join Date: May 2021
Device: kindle
|
Thanks. I'm sorry but i made an error in the feed links..
The US feed is supposed to be https://www.ft.com/us?format=rss instead of 'world'. and the Markets feed https://www.ft.com/markets?format=rss I didn't notice this before as I was just trying to make json to html work. Last edited by unkn0wn; 05-01-2022 at 12:01 PM. |
![]() |
![]() |
![]() |
#7 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 615
Karma: 85520
Join Date: May 2021
Device: kindle
|
it wouldnt matter if you dont make these changes..
in outlook recipe.. adding description desc = h3.find_next_sibling('p') desc = self.tag_to_string(desc) ans.append({ 'title': title, 'url': url, 'description': desc}) FT recipe masthead_url = 'https://im.ft-static.com/m/img/masthead_main.jpg' and maybe put opinion feed before world feed.. why remove embeded images ![]() ![]() Last edited by unkn0wn; 05-02-2022 at 03:22 AM. |
![]() |
![]() |
![]() |
#8 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 615
Karma: 85520
Join Date: May 2021
Device: kindle
|
FT Print Edition
https://www.ft.com/todaysnewspaper/
There's also uk edition.. this edition might load automatically based on region. I just changed the feeds part from recipe to parse feeds from print page. I never thought to look for this page before.. the number of articles in print edition are very less compared to feeds. the cover_url is uk edition.. and uk edition has more sections and more articles.. like FT big read which is missing in intl edition.. maybe change the soup link to uk edition. (has all intl articles) change NoArticles text to 'The Financial Times Newspaper is not published on Sundays.' Last edited by unkn0wn; 05-03-2022 at 07:59 AM. |
![]() |
![]() |
![]() |
#9 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 615
Karma: 85520
Join Date: May 2021
Device: kindle
|
after small changes..
|
![]() |
![]() |
![]() |
#10 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 615
Karma: 85520
Join Date: May 2021
Device: kindle
|
I found that outlook magazine from issue archives isn't the latest (a week older)..
I changed recipe to find latest https://github.com/kovidgoyal/calibr...k_india.recipe changes to lines 16-24 Code:
def parse_index(self): soup = self.index_to_soup('https://www.outlookindia.com/') a = soup.find('a', href=lambda x: x and x.startswith('/magazine/issue/')) url = a['href'] self.log('Downloading issue:', url) soup = self.index_to_soup('https://www.outlookindia.com' + url) cover = soup.find(**classes('listingPage_lead_story')) self.cover_url = cover.find('img', attrs = {'src': True})['src'] ans = [] |
![]() |
![]() |
![]() |
#11 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 615
Karma: 85520
Join Date: May 2021
Device: kindle
|
Outlook
turns out the latest edition loads all articles without the need to extract from json.. while the previous editions from archive page needed subscription. (there are no image links in json while normal page loads images)
so, I just hashed/commented out the json code for future use, and changed other stuff. https://github.com/kovidgoyal/calibr...k_india.recipe There's another monthly outlook business magazine requires the exact same code while changing links. and another recipe for Business Today Magazine (somewhat similar to India Today). |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
outlook India recipe error | mitra | Recipes | 2 | 02-19-2016 11:59 PM |
Outlook magazine India cover | Doc_A | Recipes | 0 | 01-09-2016 10:23 AM |
Outlook India not accessible for last 2 weeks on calibre | Doc_A | Recipes | 8 | 06-07-2014 11:37 AM |
Caravan Magazine India | Yash912 | Recipes | 0 | 09-08-2013 09:39 AM |
PwC study: Outlook for magazine publishing in the digital age | TadW | News | 0 | 07-02-2008 05:16 AM |