Guardian recipe update for new site

paddyrm · 01-16-2018, 06:17 PM

The Guardian has revamped its website to map to the new tabloid print version. The attached recipe files (one for daily, one for Saturday) seem to work well on the Kindle. Not worked on the Observer yet.
Paddy, January 2018

kovidgoyal · 01-16-2018, 11:48 PM

I'm a little confused. SHouldn't there be just one recipe that changes according to whether it is a weekend or not? At least, that's the way I think the current recipe works.

paddyrm · 01-17-2018, 08:30 AM

"weekend" is a supplement that only appears on Saturday but can and would be downloaded every day. So my lazy way is to have a separate script! Ideally the single script would only pull down the supplement on Saturday in the same way that it goes to the Observer site on Sunday.
Any improvements you can suggest would be most welcome from this amateur!

kovidgoyal · 01-17-2018, 08:38 AM

Something like:

Code:

if date.today().weekday() in (5, 6):
   feeds += self.parse_section('https://www.theguardian.com/theguardian/weekend', 'Weekend - ')

should do the trick

Del542 · 01-18-2018, 06:23 AM

I have an additional although more basic question about the Guardian feeds! Since the launch of the redesigned Guardian earlier this week I have been having problems with the Guardian feeds displaying several blank pages on each article after an introductory sentence and photo.

I have a few simple Guardian feeds created in epub format for my Windows 10 tablet running Calibre. They include various themes such as 'Guardian opinion' or 'Guardian Football'. I also try the default Guardian feed on Calibre but find that these more specific feeds are quicker to create.

However the last few days in my created Guardian feeds although the articles are still appearing and loading there is the problem of many blank pages. Is there a simple setting in Calibre to help out with this?

Many thanks for any help here!

kovidgoyal · 01-18-2018, 06:24 AM

There is no simple setting you have to add the code to cleanup the downloaded html to the recipe, in advanced mode.

Del542 · 01-18-2018, 01:22 PM

Quote:

Originally Posted by kovidgoyal

There is no simple setting you have to add the code to cleanup the downloaded html to the recipe, in advanced mode.

Thanks for your quick help with this.

Have you got any examples of the sort of code which could help cleanup the download html?

kovidgoyal · 01-18-2018, 11:36 PM

https://manual.calibre-ebook.com/news.html

Omniscient1 · 02-11-2018, 05:03 PM

I haven't looked at the submitted recipe (I will do tomorrow) but here is mine.

It autoswitches for Sunday's edition (The Observer)

Code:

#!/usr/bin/env  python2
__license__ = 'GPL v3'
__copyright__ = '2008, Kovid Goyal kovid@kovidgoyal.net'
__docformat__ = 'restructuredtext en'

'''
www.guardian.co.uk
'''
from calibre.web.feeds.news import BasicNewsRecipe
from datetime import date


class Guardian(BasicNewsRecipe):

    title = u'The Guardian'
    if date.today().weekday() == 6:
        title = u'The Observer'
        base_url = "http://www.guardian.co.uk/theobserver"
        cover_url = 'https://i.guim.co.uk/img/media/ec57a66a548b748cd586a8b63927aad0b167b80a/0_0_642_798/master/642.jpg?w=300&q=55&auto=format&usm=12&fit=max&'
        masthead_url = 'http://static.guim.co.uk/sys-images/Guardian/Pix/site_furniture/2010/10/19/1287478087992/The-Observer-001.gif'
    else:
        base_url = "http://www.guardian.co.uk/theguardian"
#        cover_pic = 'Guardian digital edition'
#        masthead_url = 'http://static.guim.co.uk/static/f76b43f9dcfd761f0ecf7099a127b603b2922118/common/images/logos/the-guardian/titlepiece.gif'
        cover_url = 'https://i.guim.co.uk/img/media/0dcdddf037927063ea4f420e8d5baecece39d5a4/0_0_1128_1403/master/1128.png?w=700&q=55&auto=format&usm=12&fit=max&'
 #       masthead_url = 'https://assets.guim.co.uk/images/eada8aa27c12fe2d5afa3a89d3fbae0d/fallback-logo.png'
        masthead_url = 'http://www.logo-designer.co/wp-content/uploads/2018/01/2018-The-Guardian-logo-design.png'
    __author__ = 'Kovid Goyal'
    language = 'en_GB'

    oldest_article = 1
    max_articles_per_feed = 300
    remove_javascript = True
    encoding = 'utf-8'
    remove_empty_feeds = True
    no_stylesheets = True
    remove_attributes = ['style']
    ignore_duplicate_articles = {'title', 'url'}

    timefmt = ' [%a, %d %b %Y]'

    keep_only_tags = [
        dict(attrs={'class': lambda x: x and 'content__main-column' in x.split()}),
    ]
    remove_tags = [
        dict(attrs={'class': lambda x: x and '--twitter' in x}),
        dict(attrs={'class': lambda x: x and 'submeta' in x.split()}),
        dict(attrs={'data-component': ['share', 'social']}),
        dict(attrs={'data-link-name': 'block share'}),
        dict(attrs={'class': lambda x: x and 'inline-expand-image' in x}),
        dict(attrs={'class': lambda x: x and 'modern-visible' in x.split()}),
        dict(name=['link', 'meta', 'style']),
    ]
    remove_tags_after = [
        dict(attrs={'class': lambda x: x and 'content__article-body' in x.split()}),
    ]

    def preprocess_raw_html(self, raw, url):
        import html5lib
        from lxml import html
        return html.tostring(html5lib.parse(raw, namespaceHTMLElements=False, treebuilder='lxml'), encoding=unicode)

    def preprocess_html(self, soup):
        for img in soup.findAll('img', srcset=True):
            img['src'] = img['srcset'].partition(' ')[0]
            img['srcset'] = ''
        return soup

    def parse_section(self, url, title_prefix=''):
        feeds = []
        soup = self.index_to_soup(url)
        for section in soup.findAll('section'):
            title = title_prefix + self.tag_to_string(section.find(
                attrs={'class': 'fc-container__header__title'})).strip().capitalize()
            self.log('\nFound section:', title)
            feeds.append((title, []))
            for li in section.findAll('li'):
                for a in li.findAll('a', attrs={'data-link-name': 'article'}, href=True):
                    title = self.tag_to_string(a).strip()
                    url = a['href']
                    self.log(' ', title, url)
                    feeds[-1][1].append({'title': title, 'url': url})
                    break
        return feeds

    def parse_index(self):
        feeds = self.parse_section(self.base_url)
        if date.today().weekday() == 5:
            feeds += self.parse_section (
            'https://www.theguardian.com/theguardian/family', 'Family - ')
            feeds += self.parse_section (
            'https://www.theguardian.com/theguardian/guardianreview', 'Guardian Review - ')
            feeds += self.parse_section (
            'https://www.theguardian.com/theguardian/weekend', 'Weekend Magazine - ')
            feeds += self.parse_section (
            'https://www.theguardian.com/theguardian/theguide', 'The Guide - ')
        else:
          if date.today().weekday() == 6:
              feeds += self.parse_section (
               'https://www.theguardian.com/theobserver/new-review', 'New Review ')
              feeds += self.parse_section (
               'https://www.theguardian.com/theobserver/news/comment', 'Comment ')
              feeds += self.parse_section (
               'https://www.theguardian.com/theobserver/magazine', 'Observer Magazine ')
          else:
              feeds += self.parse_section (
               'https://www.theguardian.com/tone/obituaries/all', 'Obituaries - ' )
              feeds += self.parse_section (
               'https://www.theguardian.com/uk/commentisfree', 'Editorial - ' )
              feeds += self.parse_section (
               'https://www.theguardian.com/theguardian/g2', 'G2 - ' ) 

        feeds += self.parse_section(
        'https://www.theguardian.com/uk/sport', 'Sport - ')
        return feeds

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Problem with a Guardian.co.uk recipe	Andrewzy	Recipes	4	12-08-2012 04:22 AM
The Guardian/The observer broken recipe ?	wingmongyee	Recipes	6	07-08-2011 11:38 PM
The Guardian recipe, more sections ?	mrwout	Recipes	0	04-11-2011 06:22 PM
Guardian Recipe has stopped working	jbambridge	Calibre	2	04-11-2010 02:14 PM
Guardian recipe still erratic	pars_andy	Calibre	17	12-24-2009 02:31 PM

01-16-2018, 11:48 PM	#2
kovidgoyal creator of calibre Posts: 45,615 Karma: 28549044 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I'm a little confused. SHouldn't there be just one recipe that changes according to whether it is a weekend or not? At least, that's the way I think the current recipe works.

01-17-2018, 08:30 AM	#3
paddyrm Connoisseur Posts: 69 Karma: 10 Join Date: Oct 2012 Device: Kindle 3	"weekend" is a supplement that only appears on Saturday but can and would be downloaded every day. So my lazy way is to have a separate script! Ideally the single script would only pull down the supplement on Saturday in the same way that it goes to the Observer site on Sunday. Any improvements you can suggest would be most welcome from this amateur!

01-17-2018, 08:38 AM	#4
kovidgoyal creator of calibre Posts: 45,615 Karma: 28549044 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Something like: Code: if date.today().weekday() in (5, 6): feeds += self.parse_section('https://www.theguardian.com/theguardian/weekend', 'Weekend - ') should do the trick

01-18-2018, 06:23 AM	#5
Del542 Junior Member Posts: 2 Karma: 10 Join Date: Jan 2018 Device: Linx 8 Windows tablet	I have an additional although more basic question about the Guardian feeds! Since the launch of the redesigned Guardian earlier this week I have been having problems with the Guardian feeds displaying several blank pages on each article after an introductory sentence and photo. I have a few simple Guardian feeds created in epub format for my Windows 10 tablet running Calibre. They include various themes such as 'Guardian opinion' or 'Guardian Football'. I also try the default Guardian feed on Calibre but find that these more specific feeds are quicker to create. However the last few days in my created Guardian feeds although the articles are still appearing and loading there is the problem of many blank pages. Is there a simple setting in Calibre to help out with this? Many thanks for any help here!

01-18-2018, 06:24 AM	#6
kovidgoyal creator of calibre Posts: 45,615 Karma: 28549044 Join Date: Oct 2006 Location: Mumbai, India Device: Various	There is no simple setting you have to add the code to cleanup the downloaded html to the recipe, in advanced mode.

01-18-2018, 11:36 PM	#8
kovidgoyal creator of calibre Posts: 45,615 Karma: 28549044 Join Date: Oct 2006 Location: Mumbai, India Device: Various	https://manual.calibre-ebook.com/news.html

Advert

Advert