01-08-2019, 08:47 AM | #1 |
Junior Member
Posts: 2
Karma: 10
Join Date: Jan 2019
Device: Kindle Paperwhite
|
News Fetch from Scientific American failing
This was a known issue. I did not know that at the time of reporting.
Sorry for the inconvenience. Please delete this thread. Subscription to Scientific American is required to fetch it. I guess the description to that subscription is optional should be corrected. Thanks! Last edited by amj; 01-08-2019 at 09:08 AM. Reason: Known issue |
01-08-2019, 07:14 PM | #2 |
Enthusiast
Posts: 36
Karma: 10
Join Date: Dec 2017
Location: Los Angeles, CA
Device: Smart Phone
|
Update for the Scientific American
Actually, you can still download the first few sentences, so you can see what's in the articles and some of them do download completely. This code works for me if you want to try it.
Update to the Scientific American Recipe Code:
#!/usr/bin/env python2 __license__ = 'GPL v3' from calibre.web.feeds.news import BasicNewsRecipe from calibre.utils.date import now from css_selectors import Select def absurl(url): if url.startswith('/'): url = 'http://www.scientificamerican.com' + url return url keep_classes = {'article-header', 'article-content', 'article-media', 'article-author', 'article-text'} remove_classes = {'aside-banner', 'moreToExplore', 'article-footer'} class ScientificAmerican(BasicNewsRecipe): title = u'Scientific American' description = u'Popular Science. Monthly magazine. Should be downloaded around the middle of each month.' category = 'science' __author__ = 'Kovid Goyal' no_stylesheets = True language = 'en' publisher = 'Nature Publishing Group' remove_empty_feeds = True remove_javascript = True timefmt = ' [%B %Y]' needs_subscription = 'optional' keep_only_tags = [ dict(attrs={'class': lambda x: x and bool( set(x.split()).intersection(keep_classes))}), ] remove_tags = [ dict(attrs={'class': lambda x: x and bool( set(x.split()).intersection(remove_classes))}), dict(id=['seeAlsoLinks']), ] def get_browser(self, *args): br = BasicNewsRecipe.get_browser(self) if self.username and self.password: br.open('https://www.scientificamerican.com/my-account/login/') br.select_form(predicate=lambda f: f.attrs.get('id') == 'login') br['emailAddress'] = self.username br['password'] = self.password br.submit() return br def parse_index(self): # Get the cover, date and issue URL root = self.index_to_soup( 'http://www.scientificamerican.com/sciammag/', as_tree=True) select = Select(root) url = [x.get('content', '') for x in select('html > head meta') if x.get('property',None) == "og:url"][0] self.cover_url = [x.get('src', '') for x in select('main .product-detail__image picture img')][0] # Now parse the actual issue to get the list of articles select = Select(self.index_to_soup(url, as_tree=True)) feeds = [] for i, section in enumerate(select('#sa_body .toc-articles')): if i == 0: feeds.append( ('Features', list(self.parse_sciam_features(select, section)))) else: feeds.extend(self.parse_sciam_departments(select, section)) return feeds def parse_sciam_features(self, select, section): for article in select('article[data-article-title]', section): title = article.get('data-article-title') for a in select('a[href]', article): url = absurl(a.get('href')) break desc = '' for p in select('p.t_body', article): desc = self.tag_to_string(p) break self.log('Found feature article: %s at %s' % (title, url)) self.log('\t' + desc) yield {'title': title, 'url': url, 'description': desc} def parse_sciam_departments(self, select, section): section_title, articles = 'Unknown', [] for li in select('li[data-article-title]', section): for span in select('span.department-title', li): if articles: yield section_title, articles section_title, articles = self.tag_to_string(span), [] self.log('\nFound section: %s' % section_title) break for a in select('h2 a[href]', li): title = self.tag_to_string(a) url = absurl(a.get('href')) articles.append( {'title': title, 'url': url, 'description': ''}) self.log('\tFound article: %s at %s' % (title, url)) if articles: yield section_title, articles |
01-09-2019, 04:07 AM | #3 |
Junior Member
Posts: 2
Karma: 10
Join Date: Jan 2019
Device: Kindle Paperwhite
|
Thanks a ton! Your code snippet works for me. Even the out of the box version is working now. Can't understand why this happened.
|
02-28-2019, 10:33 PM | #4 |
Enthusiast
Posts: 36
Karma: 10
Join Date: Dec 2017
Location: Los Angeles, CA
Device: Smart Phone
|
Update for Scientific American
It seems they've made changes to the website again. Here is an update that fixes the problem.
Update for Scientific American: Code:
#!/usr/bin/env python2 __license__ = 'GPL v3' from calibre.web.feeds.news import BasicNewsRecipe from css_selectors import Select def absurl(url): if url.startswith('/'): url = 'http://www.scientificamerican.com' + url return url keep_classes = {'article-header', 'article-content', 'article-media', 'article-author', 'article-text'} remove_classes = {'aside-banner', 'moreToExplore', 'article-footer'} class ScientificAmerican(BasicNewsRecipe): title = u'Scientific American' description = u'Popular Science. Monthly magazine. Should be downloaded around the middle of each month.' category = 'science' __author__ = 'Kovid Goyal' no_stylesheets = True language = 'en' publisher = 'Nature Publishing Group' remove_empty_feeds = True remove_javascript = True timefmt = ' [%B %Y]' needs_subscription = 'optional' keep_only_tags = [ dict(attrs={'class': lambda x: x and bool( set(x.split()).intersection(keep_classes))}), ] remove_tags = [ dict(attrs={'class': lambda x: x and bool( set(x.split()).intersection(remove_classes))}), dict(id=['seeAlsoLinks']), ] def get_browser(self, *args): br = BasicNewsRecipe.get_browser(self) if self.username and self.password: br.open('https://www.scientificamerican.com/my-account/login/') br.select_form(predicate=lambda f: f.attrs.get('id') == 'login') br['emailAddress'] = self.username br['password'] = self.password br.submit() return br def parse_index(self): # Get the cover, date and issue URL root = self.index_to_soup( 'http://www.scientificamerican.com/sciammag/', as_tree=True) select = Select(root) url = [x.get('href', '') for x in select('main .store-listing__img a')][0] url = absurl(url) self.cover_url = [x.get('src', '') for x in select('main .store-listing__img img')][0] # Now parse the actual issue to get the list of articles select = Select(self.index_to_soup(url, as_tree=True)) feeds = [] for i, section in enumerate(select('#sa_body .toc-articles')): if i == 0: feeds.append( ('Features', list(self.parse_sciam_features(select, section)))) else: feeds.extend(self.parse_sciam_departments(select, section)) return feeds def parse_sciam_features(self, select, section): for article in select('article[data-article-title]', section): title = article.get('data-article-title') for a in select('a[href]', article): url = absurl(a.get('href')) break desc = '' for p in select('p.t_body', article): desc = self.tag_to_string(p) break self.log('Found feature article: %s at %s' % (title, url)) self.log('\t' + desc) yield {'title': title, 'url': url, 'description': desc} def parse_sciam_departments(self, select, section): section_title, articles = 'Unknown', [] for li in select('li[data-article-title]', section): for span in select('span.department-title', li): if articles: yield section_title, articles section_title, articles = self.tag_to_string(span), [] self.log('\nFound section: %s' % section_title) break for a in select('h2 a[href]', li): title = self.tag_to_string(a) url = absurl(a.get('href')) articles.append( {'title': title, 'url': url, 'description': ''}) self.log('\tFound article: %s at %s' % (title, url)) if articles: yield section_title, articles |
Tags |
receipe, scientific american |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Scientific American broken? | NSILMike | Recipes | 4 | 12-13-2014 11:30 AM |
Slate fetch failing | BillD | Recipes | 8 | 11-01-2011 08:59 AM |
Fetch News failing (All strings must be XML compatible | nuveen | Recipes | 11 | 10-01-2011 12:01 PM |
Scientific American | Starson17 | Recipes | 13 | 09-25-2010 03:37 PM |
Scientific American recipe | Stingo | Calibre | 2 | 10-30-2009 05:42 PM |