News Fetch from Scientific American failing

amj · 01-08-2019, 08:47 AM

This was a known issue. I did not know that at the time of reporting.

Sorry for the inconvenience. Please delete this thread.

Subscription to Scientific American is required to fetch it.

I guess the description to that subscription is optional should be corrected.

Thanks!

lui1 · 01-08-2019, 07:14 PM

Actually, you can still download the first few sentences, so you can see what's in the articles and some of them do download completely. This code works for me if you want to try it.

Update to the Scientific American Recipe

Code:

#!/usr/bin/env  python2
__license__ = 'GPL v3'

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.utils.date import now
from css_selectors import Select


def absurl(url):
    if url.startswith('/'):
        url = 'http://www.scientificamerican.com' + url
    return url

keep_classes = {'article-header', 'article-content',
                'article-media', 'article-author', 'article-text'}
remove_classes = {'aside-banner', 'moreToExplore', 'article-footer'}


class ScientificAmerican(BasicNewsRecipe):
    title = u'Scientific American'
    description = u'Popular Science. Monthly magazine. Should be downloaded around the middle of each month.'
    category = 'science'
    __author__ = 'Kovid Goyal'
    no_stylesheets = True
    language = 'en'
    publisher = 'Nature Publishing Group'
    remove_empty_feeds = True
    remove_javascript = True
    timefmt = ' [%B %Y]'

    needs_subscription = 'optional'

    keep_only_tags = [
        dict(attrs={'class': lambda x: x and bool(
            set(x.split()).intersection(keep_classes))}),
    ]
    remove_tags = [
        dict(attrs={'class': lambda x: x and bool(
            set(x.split()).intersection(remove_classes))}),
        dict(id=['seeAlsoLinks']),
    ]

    def get_browser(self, *args):
        br = BasicNewsRecipe.get_browser(self)
        if self.username and self.password:
            br.open('https://www.scientificamerican.com/my-account/login/')
            br.select_form(predicate=lambda f: f.attrs.get('id') == 'login')
            br['emailAddress'] = self.username
            br['password'] = self.password
            br.submit()
        return br

    def parse_index(self):
        # Get the cover, date and issue URL
        root = self.index_to_soup(
            'http://www.scientificamerican.com/sciammag/', as_tree=True)
        select = Select(root)
        url = [x.get('content', '') for x in select('html > head  meta') if x.get('property',None) == "og:url"][0]
        self.cover_url = [x.get('src', '') for x in select('main .product-detail__image picture img')][0]

        # Now parse the actual issue to get the list of articles
        select = Select(self.index_to_soup(url, as_tree=True))
        feeds = []
        for i, section in enumerate(select('#sa_body .toc-articles')):
            if i == 0:
                feeds.append(
                    ('Features', list(self.parse_sciam_features(select, section))))
            else:
                feeds.extend(self.parse_sciam_departments(select, section))

        return feeds

    def parse_sciam_features(self, select, section):
        for article in select('article[data-article-title]', section):
            title = article.get('data-article-title')
            for a in select('a[href]', article):
                url = absurl(a.get('href'))
                break
            desc = ''
            for p in select('p.t_body', article):
                desc = self.tag_to_string(p)
                break
            self.log('Found feature article: %s at %s' % (title, url))
            self.log('\t' + desc)
            yield {'title': title, 'url': url, 'description': desc}

    def parse_sciam_departments(self, select, section):
        section_title, articles = 'Unknown', []
        for li in select('li[data-article-title]', section):
            for span in select('span.department-title', li):
                if articles:
                    yield section_title, articles
                section_title, articles = self.tag_to_string(span), []
                self.log('\nFound section: %s' % section_title)
                break
            for a in select('h2 a[href]', li):
                title = self.tag_to_string(a)
                url = absurl(a.get('href'))
                articles.append(
                    {'title': title, 'url': url, 'description': ''})
                self.log('\tFound article: %s at %s' % (title, url))
        if articles:
            yield section_title, articles

amj · 01-09-2019, 04:07 AM

Thanks a ton! Your code snippet works for me. Even the out of the box version is working now. Can't understand why this happened.

lui1 · 02-28-2019, 10:33 PM

It seems they've made changes to the website again. Here is an update that fixes the problem.

Update for Scientific American:

Code:

#!/usr/bin/env  python2
__license__ = 'GPL v3'

from calibre.web.feeds.news import BasicNewsRecipe
from css_selectors import Select


def absurl(url):
    if url.startswith('/'):
        url = 'http://www.scientificamerican.com' + url
    return url


keep_classes = {'article-header', 'article-content',
                'article-media', 'article-author', 'article-text'}
remove_classes = {'aside-banner', 'moreToExplore', 'article-footer'}


class ScientificAmerican(BasicNewsRecipe):
    title = u'Scientific American'
    description = u'Popular Science. Monthly magazine. Should be downloaded around the middle of each month.'
    category = 'science'
    __author__ = 'Kovid Goyal'
    no_stylesheets = True
    language = 'en'
    publisher = 'Nature Publishing Group'
    remove_empty_feeds = True
    remove_javascript = True
    timefmt = ' [%B %Y]'

    needs_subscription = 'optional'

    keep_only_tags = [
        dict(attrs={'class': lambda x: x and bool(
            set(x.split()).intersection(keep_classes))}),
    ]
    remove_tags = [
        dict(attrs={'class': lambda x: x and bool(
            set(x.split()).intersection(remove_classes))}),
        dict(id=['seeAlsoLinks']),
    ]

    def get_browser(self, *args):
        br = BasicNewsRecipe.get_browser(self)
        if self.username and self.password:
            br.open('https://www.scientificamerican.com/my-account/login/')
            br.select_form(predicate=lambda f: f.attrs.get('id') == 'login')
            br['emailAddress'] = self.username
            br['password'] = self.password
            br.submit()
        return br

    def parse_index(self):
        # Get the cover, date and issue URL
        root = self.index_to_soup(
            'http://www.scientificamerican.com/sciammag/', as_tree=True)
        select = Select(root)
        url = [x.get('href', '') for x in select('main .store-listing__img a')][0]
        url = absurl(url)
        self.cover_url = [x.get('src', '') for x in select('main .store-listing__img img')][0]

        # Now parse the actual issue to get the list of articles
        select = Select(self.index_to_soup(url, as_tree=True))
        feeds = []
        for i, section in enumerate(select('#sa_body .toc-articles')):
            if i == 0:
                feeds.append(
                    ('Features', list(self.parse_sciam_features(select, section))))
            else:
                feeds.extend(self.parse_sciam_departments(select, section))

        return feeds

    def parse_sciam_features(self, select, section):
        for article in select('article[data-article-title]', section):
            title = article.get('data-article-title')
            for a in select('a[href]', article):
                url = absurl(a.get('href'))
                break
            desc = ''
            for p in select('p.t_body', article):
                desc = self.tag_to_string(p)
                break
            self.log('Found feature article: %s at %s' % (title, url))
            self.log('\t' + desc)
            yield {'title': title, 'url': url, 'description': desc}

    def parse_sciam_departments(self, select, section):
        section_title, articles = 'Unknown', []
        for li in select('li[data-article-title]', section):
            for span in select('span.department-title', li):
                if articles:
                    yield section_title, articles
                section_title, articles = self.tag_to_string(span), []
                self.log('\nFound section: %s' % section_title)
                break
            for a in select('h2 a[href]', li):
                title = self.tag_to_string(a)
                url = absurl(a.get('href'))
                articles.append(
                    {'title': title, 'url': url, 'description': ''})
                self.log('\tFound article: %s at %s' % (title, url))
        if articles:
            yield section_title, articles

01-08-2019, 08:47 AM	#1
amj Junior Member Posts: 2 Karma: 10 Join Date: Jan 2019 Device: Kindle Paperwhite	News Fetch from Scientific American failing This was a known issue. I did not know that at the time of reporting. Sorry for the inconvenience. Please delete this thread. Subscription to Scientific American is required to fetch it. I guess the description to that subscription is optional should be corrected. Thanks! Last edited by amj; 01-08-2019 at 09:08 AM. Reason: Known issue

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Scientific American broken?	NSILMike	Recipes	4	12-13-2014 11:30 AM
Slate fetch failing	BillD	Recipes	8	11-01-2011 08:59 AM
Fetch News failing (All strings must be XML compatible	nuveen	Recipes	11	10-01-2011 12:01 PM
Scientific American	Starson17	Recipes	13	09-25-2010 03:37 PM
Scientific American recipe	Stingo	Calibre	2	10-30-2009 05:42 PM

01-09-2019, 04:07 AM	#3
amj Junior Member Posts: 2 Karma: 10 Join Date: Jan 2019 Device: Kindle Paperwhite	Thanks a ton! Your code snippet works for me. Even the out of the box version is working now. Can't understand why this happened.