04-23-2019, 08:49 PM | #1 |
Junior Member
Posts: 7
Karma: 10
Join Date: Apr 2019
Device: kindle
|
help for the recipe of globaltimes
calibre is so powerful, but recently i find there is something wrong with the recipe of globaltimes(doesn't work).I was wondering if someone could make some updates?the site adreess is
http://www.globaltimes.cn/ The version of the software I'm using is 3.41 ,thanks a lot. |
04-24-2019, 12:59 AM | #2 |
Enthusiast
Posts: 36
Karma: 10
Join Date: Dec 2017
Location: Los Angeles, CA
Device: Smart Phone
|
Global News Update
It looks like the links were broken and the structure of the markup changed. So it needed to be redone.
Update for Global News: Code:
from calibre.web.feeds.news import BasicNewsRecipe import re def check_words(words): return lambda x: x and frozenset(words.split()).intersection(x.split()) class GlobalTimes(BasicNewsRecipe): title = u'Global Times' __author__ = 'Jose Ortiz' # lui1 at mobileread.com language = 'en_CN' oldest_article = 7 max_articles_per_feed = 100 no_stylesheets = True keep_only_tags = [ { 'class': check_words('article-title article-source row-content') }, ] extra_css = ''' .article-title { font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large; } .article-source, .row-content { font-family:Arial,Helvetica,sans-serif; font-size:small; } ''' def parse_index(self): catnames = {} catnames["http://www.globaltimes.cn/china/politics/"] = "China Politics" catnames["http://www.globaltimes.cn/china/diplomacy/"] = "China Diplomacy" catnames["http://www.globaltimes.cn/china/military/"] = "China Military" catnames["http://www.globaltimes.cn/business/economy/"] = "China Economy" catnames["http://www.globaltimes.cn/world/asia-pacific/"] = "Asia Pacific" feeds = [] for cat in catnames.keys(): articles = [] soup = self.index_to_soup(cat) for a in soup.findAll('a', attrs={'href': re.compile(r'https?://www.globaltimes.cn/content/[0-9]{4,10}[.]shtml')}): url = a['href'].strip() # Typical url http://www.globaltimes.cn/content/5555555.shtml title = self.tag_to_string(a).strip() if not title: continue myarticle = ({'title': title, 'url': url, 'description': '', 'date': ''}) self.log("found '%s'" % title) articles.append(myarticle) self.log("Adding URL %s\n" % url) if articles: feeds.append((catnames[cat], articles)) return feeds |
04-24-2019, 05:45 AM | #3 |
Junior Member
Posts: 7
Karma: 10
Join Date: Apr 2019
Device: kindle
|
Your help is greatly appreciated!I just tested it and found the recipe workes perfectly, but there are too many blank between paragraphs in the article, which affected the reading experience.It would be perfect if this problem could be fixed ! thanks again.
|
04-24-2019, 03:20 PM | #4 |
Enthusiast
Posts: 36
Karma: 10
Join Date: Dec 2017
Location: Los Angeles, CA
Device: Smart Phone
|
Update for the Global Times
Your welcome . The problem is that they are using br line breaks instead of p elements to seperate there paragraphs. This update should improve the layout.
Update for Global Times: Code:
from calibre.web.feeds.news import BasicNewsRecipe import re def classes(classes): q = frozenset(classes.split(' ')) return dict( attrs={'class': lambda x: x and frozenset(x.split()).intersection(q)} ) class GlobalTimes(BasicNewsRecipe): title = u'Global Times' __author__ = 'Jose Ortiz' # lui1 at mobileread.com language = 'en_CN' oldest_article = 7 max_articles_per_feed = 100 no_stylesheets = True keep_only_tags = [classes('article-title article-source row-content')] preprocess_regexps = [ (re.compile(r'(?:<(?:br(?:\s*/)?|/br\s*)>(?:\s|' '\xA0' r'| )*){2,9}', re.U | re.I), lambda match: '<p>') ] extra_css = ''' :root { font-family: Arial, Helvetica, sans-serif; } .article-title { font-weight: bold; font-size: large; } .article-source, .row-content { font-size:small; } ''' def parse_index(self): catnames = {} catnames["http://www.globaltimes.cn/china/politics/"] = "China Politics" catnames["http://www.globaltimes.cn/china/diplomacy/"] = "China Diplomacy" catnames["http://www.globaltimes.cn/china/military/"] = "China Military" catnames["http://www.globaltimes.cn/business/economy/"] = "China Economy" catnames["http://www.globaltimes.cn/world/asia-pacific/"] = "Asia Pacific" feeds = [] for cat in catnames.keys(): articles = [] soup = self.index_to_soup(cat) for a in soup.findAll( 'a', attrs={ 'href': re.compile( r'https?://www.globaltimes.cn/content/[0-9]{4,10}[.]shtml' ) } ): # Typical url http://www.globaltimes.cn/content/5555555.shtml url = a['href'].strip() title = self.tag_to_string(a).strip() if not title: continue myarticle = ({ 'title': title, 'url': url, 'description': '', 'date': '' }) self.log("found '%s'" % title) articles.append(myarticle) self.log("Adding URL %s\n" % url) if articles: feeds.append((catnames[cat], articles)) return feeds def postprocess_html(self, soup, first_fetch): for p in [p for p in soup('p') if len(p) == 0]: p.extract() return soup Last edited by lui1; 04-24-2019 at 07:27 PM. Reason: made a few more improvements to the recipe |
04-24-2019, 09:48 PM | #5 |
Junior Member
Posts: 7
Karma: 10
Join Date: Apr 2019
Device: kindle
|
Thanks so much, and thanks for responding so quickly.
|
Tags |
recipes |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Per-recipe settings without editing the recipe? | bobbysteel | Recipes | 3 | 03-05-2017 07:40 AM |
Recipe for Het Laatste Nieuws (Belgian newspaper) based on built in recipe of Darko M | erkfuizfeuadjfjz | Recipes | 0 | 02-17-2017 03:11 PM |
Recipe voor De Tijd (Belgian newspaper) based on built in recipe of Darko Miletic | erkfuizfeuadjfjz | Recipes | 0 | 02-17-2017 02:43 PM |
ft recipe financial_times_us.recipe | piet8stevens | Recipes | 3 | 03-05-2016 03:55 AM |
Recipe works when mocked up as Python file, fails when converted to Recipe | ode | Recipes | 7 | 09-04-2011 04:57 AM |