![]() |
#1 |
Connoisseur
![]() Posts: 69
Karma: 10
Join Date: Oct 2012
Device: Kindle 3
|
Guardian recipe update for new site
The Guardian has revamped its website to map to the new tabloid print version. The attached recipe files (one for daily, one for Saturday) seem to work well on the Kindle. Not worked on the Observer yet.
Paddy, January 2018 |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,345
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I'm a little confused. SHouldn't there be just one recipe that changes according to whether it is a weekend or not? At least, that's the way I think the current recipe works.
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Connoisseur
![]() Posts: 69
Karma: 10
Join Date: Oct 2012
Device: Kindle 3
|
"weekend" is a supplement that only appears on Saturday but can and would be downloaded every day. So my lazy way is to have a separate script! Ideally the single script would only pull down the supplement on Saturday in the same way that it goes to the Observer site on Sunday.
Any improvements you can suggest would be most welcome from this amateur! |
![]() |
![]() |
![]() |
#4 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,345
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Something like:
Code:
if date.today().weekday() in (5, 6): feeds += self.parse_section('https://www.theguardian.com/theguardian/weekend', 'Weekend - ') |
![]() |
![]() |
![]() |
#5 |
Junior Member
![]() Posts: 2
Karma: 10
Join Date: Jan 2018
Device: Linx 8 Windows tablet
|
I have an additional although more basic question about the Guardian feeds! Since the launch of the redesigned Guardian earlier this week I have been having problems with the Guardian feeds displaying several blank pages on each article after an introductory sentence and photo.
I have a few simple Guardian feeds created in epub format for my Windows 10 tablet running Calibre. They include various themes such as 'Guardian opinion' or 'Guardian Football'. I also try the default Guardian feed on Calibre but find that these more specific feeds are quicker to create. However the last few days in my created Guardian feeds although the articles are still appearing and loading there is the problem of many blank pages. Is there a simple setting in Calibre to help out with this? Many thanks for any help here! |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,345
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
There is no simple setting you have to add the code to cleanup the downloaded html to the recipe, in advanced mode.
|
![]() |
![]() |
![]() |
#7 |
Junior Member
![]() Posts: 2
Karma: 10
Join Date: Jan 2018
Device: Linx 8 Windows tablet
|
|
![]() |
![]() |
![]() |
#8 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,345
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
![]() |
![]() |
![]() |
#9 |
Member
![]() Posts: 12
Karma: 10
Join Date: Sep 2017
Device: Kindle Paper White
|
I haven't looked at the submitted recipe (I will do tomorrow) but here is mine.
It autoswitches for Sunday's edition (The Observer) Code:
#!/usr/bin/env python2 __license__ = 'GPL v3' __copyright__ = '2008, Kovid Goyal kovid@kovidgoyal.net' __docformat__ = 'restructuredtext en' ''' www.guardian.co.uk ''' from calibre.web.feeds.news import BasicNewsRecipe from datetime import date class Guardian(BasicNewsRecipe): title = u'The Guardian' if date.today().weekday() == 6: title = u'The Observer' base_url = "http://www.guardian.co.uk/theobserver" cover_url = 'https://i.guim.co.uk/img/media/ec57a66a548b748cd586a8b63927aad0b167b80a/0_0_642_798/master/642.jpg?w=300&q=55&auto=format&usm=12&fit=max&' masthead_url = 'http://static.guim.co.uk/sys-images/Guardian/Pix/site_furniture/2010/10/19/1287478087992/The-Observer-001.gif' else: base_url = "http://www.guardian.co.uk/theguardian" # cover_pic = 'Guardian digital edition' # masthead_url = 'http://static.guim.co.uk/static/f76b43f9dcfd761f0ecf7099a127b603b2922118/common/images/logos/the-guardian/titlepiece.gif' cover_url = 'https://i.guim.co.uk/img/media/0dcdddf037927063ea4f420e8d5baecece39d5a4/0_0_1128_1403/master/1128.png?w=700&q=55&auto=format&usm=12&fit=max&' # masthead_url = 'https://assets.guim.co.uk/images/eada8aa27c12fe2d5afa3a89d3fbae0d/fallback-logo.png' masthead_url = 'http://www.logo-designer.co/wp-content/uploads/2018/01/2018-The-Guardian-logo-design.png' __author__ = 'Kovid Goyal' language = 'en_GB' oldest_article = 1 max_articles_per_feed = 300 remove_javascript = True encoding = 'utf-8' remove_empty_feeds = True no_stylesheets = True remove_attributes = ['style'] ignore_duplicate_articles = {'title', 'url'} timefmt = ' [%a, %d %b %Y]' keep_only_tags = [ dict(attrs={'class': lambda x: x and 'content__main-column' in x.split()}), ] remove_tags = [ dict(attrs={'class': lambda x: x and '--twitter' in x}), dict(attrs={'class': lambda x: x and 'submeta' in x.split()}), dict(attrs={'data-component': ['share', 'social']}), dict(attrs={'data-link-name': 'block share'}), dict(attrs={'class': lambda x: x and 'inline-expand-image' in x}), dict(attrs={'class': lambda x: x and 'modern-visible' in x.split()}), dict(name=['link', 'meta', 'style']), ] remove_tags_after = [ dict(attrs={'class': lambda x: x and 'content__article-body' in x.split()}), ] def preprocess_raw_html(self, raw, url): import html5lib from lxml import html return html.tostring(html5lib.parse(raw, namespaceHTMLElements=False, treebuilder='lxml'), encoding=unicode) def preprocess_html(self, soup): for img in soup.findAll('img', srcset=True): img['src'] = img['srcset'].partition(' ')[0] img['srcset'] = '' return soup def parse_section(self, url, title_prefix=''): feeds = [] soup = self.index_to_soup(url) for section in soup.findAll('section'): title = title_prefix + self.tag_to_string(section.find( attrs={'class': 'fc-container__header__title'})).strip().capitalize() self.log('\nFound section:', title) feeds.append((title, [])) for li in section.findAll('li'): for a in li.findAll('a', attrs={'data-link-name': 'article'}, href=True): title = self.tag_to_string(a).strip() url = a['href'] self.log(' ', title, url) feeds[-1][1].append({'title': title, 'url': url}) break return feeds def parse_index(self): feeds = self.parse_section(self.base_url) if date.today().weekday() == 5: feeds += self.parse_section ( 'https://www.theguardian.com/theguardian/family', 'Family - ') feeds += self.parse_section ( 'https://www.theguardian.com/theguardian/guardianreview', 'Guardian Review - ') feeds += self.parse_section ( 'https://www.theguardian.com/theguardian/weekend', 'Weekend Magazine - ') feeds += self.parse_section ( 'https://www.theguardian.com/theguardian/theguide', 'The Guide - ') else: if date.today().weekday() == 6: feeds += self.parse_section ( 'https://www.theguardian.com/theobserver/new-review', 'New Review ') feeds += self.parse_section ( 'https://www.theguardian.com/theobserver/news/comment', 'Comment ') feeds += self.parse_section ( 'https://www.theguardian.com/theobserver/magazine', 'Observer Magazine ') else: feeds += self.parse_section ( 'https://www.theguardian.com/tone/obituaries/all', 'Obituaries - ' ) feeds += self.parse_section ( 'https://www.theguardian.com/uk/commentisfree', 'Editorial - ' ) feeds += self.parse_section ( 'https://www.theguardian.com/theguardian/g2', 'G2 - ' ) feeds += self.parse_section( 'https://www.theguardian.com/uk/sport', 'Sport - ') return feeds Last edited by PeterT; 02-11-2018 at 04:50 PM. Reason: added [code] / [/code] wrapper |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Problem with a Guardian.co.uk recipe | Andrewzy | Recipes | 4 | 12-08-2012 03:22 AM |
The Guardian/The observer broken recipe ? | wingmongyee | Recipes | 6 | 07-08-2011 10:38 PM |
The Guardian recipe, more sections ? | mrwout | Recipes | 0 | 04-11-2011 05:22 PM |
Guardian Recipe has stopped working | jbambridge | Calibre | 2 | 04-11-2010 01:14 PM |
Guardian recipe still erratic | pars_andy | Calibre | 17 | 12-24-2009 01:31 PM |