11-15-2010, 01:00 PM | #1 |
Junior Member
Posts: 1
Karma: 10
Join Date: Nov 2010
Device: Kobo
|
Globe and Mail Recipe Rewrite..
Here is a rewrite of the Globe & Mail recipe.
Code:
class AdvancedUserRecipe1287083651(BasicNewsRecipe): title = u'Globe & Mail' __license__ = 'GPL v3' __copyright__ = '2010, Szing' oldest_article = 2 no_stylesheets = True max_articles_per_feed = 100 encoding = 'utf8' publisher = 'Globe & Mail' category = 'news, Canada, world' language = 'en_CA' extra_css = 'p.meta {font-size:75%}\n .redtext {color: red;}\n .byline {font-size: 70%}' feeds = [ (u'Top National Stories', u'http://www.theglobeandmail.com/news/national/?service=rss'), (u'Business', u'http://www.theglobeandmail.com/report-on-business/?service=rss'), (u'Commentary', u'http://www.theglobeandmail.com/report-on-business/commentary/?service=rss'), (u'Blogs', u'http://www.theglobeandmail.com/blogs/?service=rss'), (u'Facts & Arguments', u'http://www.theglobeandmail.com/life/facts-and-arguments/?service=rss'), (u'Technology', u'http://www.theglobeandmail.com/news/technology/?service=rss'), (u'Investing', u'http://www.theglobeandmail.com/globe-investor/?service=rss'), (u'Top Polical Stories', u'http://www.theglobeandmail.com/news/politics/?service=rss'), (u'Arts', u'http://www.theglobeandmail.com/news/arts/?service=rss'), (u'Life', u'http://www.theglobeandmail.com/life/?service=rss'), (u'Real Estate', u'http://www.theglobeandmail.com/real-estate/?service=rss'), (u'Auto', u'http://www.theglobeandmail.com/sports/?service=rss'), (u'Sports', u'http://www.theglobeandmail.com/auto/?service=rss') ] keep_only_tags = [ dict(name='h1'), dict(name='h2', attrs={'id':'articletitle'}), dict(name='p', attrs={'class':['leadText', 'meta', 'leadImage', 'redtext byline', 'bodyText']}), dict(name='div', attrs={'class':['news','articlemeta','articlecopy']}), dict(name='id', attrs={'class':'article'}), dict(name='table', attrs={'class':'todays-market'}), dict(name='header', attrs={'id':'leadheader'}) ] remove_tags = [ dict(name='div', attrs={'id':['tabInside', 'ShareArticles', 'topStories']}) ] #this has to be here or the text in the article appears twice. remove_tags_after = [dict(id='article')] #Use the mobile version rather than the web version def print_version(self, url): return url + '&service=mobile' |
01-12-2011, 02:57 PM | #2 |
Junior Member
Posts: 9
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
|
Further refinement to Globe&Mail recipe
Thanks man, it's a great start! I tried to refine it somewhat further.
Key changes:
Outstanding issues:
Thoughts? Comments? - guterm Code:
#!/usr/bin/env python __license__ = 'GPL v3' __copyright__ = '2011, Szing, guterm' __docformat__ = 'restructuredtext en' ''' globeandmail.com ''' from calibre.web.feeds.news import BasicNewsRecipe class TheGlobeAndMailAdvancedRecipe(BasicNewsRecipe): title = u'The Globe And Mail' __license__ = 'GPL v3' __author__ = 'Szing, guterm' oldest_article = 2 no_stylesheets = True max_articles_per_feed = 100 encoding = 'utf8' publisher = 'Globe & Mail' language = 'en_CA' extra_css = 'p.meta {font-size:75%}\n .redtext {color: red;}\n .byline {font-size: 70%}' feeds = [ (u'Top National Stories', u'http://www.theglobeandmail.com/news/national/?service=rss'), (u'Business', u'http://www.theglobeandmail.com/report-on-business/?service=rss'), (u'Investing', u'http://www.theglobeandmail.com/globe-investor/?service=rss'), (u'Politics', u'http://www.theglobeandmail.com/news/politics/?service=rss'), (u'Commentary', u'http://www.theglobeandmail.com/news/opinions/?service=rss'), (u'Toronto', u'http://www.theglobeandmail.com/news/national/toronto/?service=rss'), (u'Facts & Arguments', u'http://www.theglobeandmail.com/life/facts-and-arguments/?service=rss'), (u'Technology', u'http://www.theglobeandmail.com/news/technology/?service=rss'), (u'Arts', u'http://www.theglobeandmail.com/news/arts/?service=rss'), (u'Life', u'http://www.theglobeandmail.com/life/?service=rss'), (u'Real Estate', u'http://www.theglobeandmail.com/real-estate/?service=rss'), (u'Auto', u'http://www.theglobeandmail.com/auto/?service=rss'), (u'Sports', u'http://www.theglobeandmail.com/sports/?service=rss') ] keep_only_tags = [ dict(name='h1'), dict(name='h2', attrs={'id':'articletitle'}), dict(name='p', attrs={'class':['leadText', 'meta', 'leadImage', 'redtext byline', 'bodyText']}), dict(name='div', attrs={'class':['news','articlemeta','articlecopy']}), dict(name='id', attrs={'class':'article'}), dict(name='table', attrs={'class':'todays-market'}), dict(name='header', attrs={'id':'leadheader'}) ] remove_tags = [ dict(name='ul', attrs={'class':['pillboxcontainer arttoolsbpbx']}), dict(name='div', attrs={'class':['relcont', 'articleTools', 'ypad fontsmall', 'pagination']}), dict(name='a', attrs={'href':['javascript:void(0);', 'http://m.yp.ca?tracking=globeandmail']}), dict(name='div', attrs={'id':['ShareArticles', 'topStories']}) ] def postprocess_html(self, soup, first_fetch): # Find and preserve single page article layout, can be first or last allArts = soup.findAll(True, {'id':'article'}) if len(allArts)==2: if(len(allArts[0].contents)>len(allArts[1].contents)): allArts[1].extract() else: allArts[0].extract() return soup def parse_feeds(self, *args, **kwargs): parsed_feeds = BasicNewsRecipe.parse_feeds(self, *args, **kwargs) # Eliminate the duplicates urlSet = set() for feed in parsed_feeds: newArticles = [] for article in feed: if article.url in urlSet: feed.articles.remove( article ) else: urlSet.add(article.url) newArticles.append(article) feed.articles = newArticles return parsed_feeds # cover_url = 'http://www.freewarepocketpc.net/wp7/img/the-globe-and-mail.png' #Use the mobile version rather than the web version def print_version(self, url): return url.replace('cmpid=rss1','service=mobile') |
Advert | |
|
01-15-2011, 10:43 PM | #3 |
Junior Member
Posts: 9
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
|
And here is even more polished recipe.
Failing downloads are fixed, sending right away to the mobile site avoiding redirect, other minor tweaks. /guterm Code:
#!/usr/bin/env python __license__ = 'GPL v3' __copyright__ = '2011, Szing, guterm' __docformat__ = 'restructuredtext en' ''' globeandmail.com ''' from calibre.web.feeds.news import BasicNewsRecipe class TheGlobeAndMailAdvancedRecipe(BasicNewsRecipe): title = u'The Globe And Mail' __license__ = 'GPL v3' __author__ = 'Szing, guterm' oldest_article = 2 no_stylesheets = True max_articles_per_feed = 100 encoding = 'utf8' publisher = 'Globe & Mail' language = 'en_CA' extra_css = 'p.meta {font-size:75%}\n .redtext {color: red;}\n .byline {font-size: 70%}' feeds = [ (u'Top National Stories', u'http://www.theglobeandmail.com/news/national/?service=rss'), (u'Business', u'http://www.theglobeandmail.com/report-on-business/?service=rss'), (u'Investing', u'http://www.theglobeandmail.com/globe-investor/?service=rss'), (u'Politics', u'http://www.theglobeandmail.com/news/politics/?service=rss'), (u'Commentary', u'http://www.theglobeandmail.com/news/opinions/?service=rss'), (u'Toronto', u'http://www.theglobeandmail.com/news/national/toronto/?service=rss'), (u'Facts & Arguments', u'http://www.theglobeandmail.com/life/facts-and-arguments/?service=rss'), (u'Technology', u'http://www.theglobeandmail.com/news/technology/?service=rss'), (u'Arts', u'http://www.theglobeandmail.com/news/arts/?service=rss'), (u'Life', u'http://www.theglobeandmail.com/life/?service=rss'), (u'Real Estate', u'http://www.theglobeandmail.com/real-estate/?service=rss'), (u'Auto', u'http://www.theglobeandmail.com/auto/?service=rss'), (u'Sports', u'http://www.theglobeandmail.com/sports/?service=rss') ] keep_only_tags = [ dict(name='h1'), dict(name='h2', attrs={'id':'articletitle'}), dict(name='p', attrs={'class':['leadText', 'meta', 'leadImage', 'redtext byline', 'bodyText']}), dict(name='div', attrs={'class':['news','articlemeta','articlecopy','columnist', 'blog']}), dict(name='id', attrs={'class':'article'}), dict(name='table', attrs={'class':'todays-market'}), dict(name='header', attrs={'id':'leadheader'}) ] remove_tags = [ dict(name='ul', attrs={'class':['pillboxcontainer arttoolsbpbx']}), dict(name='div', attrs={'class':['relcont', 'articleTools', 'ypad fontsmall', 'pagination']}), dict(name='a', attrs={'href':['javascript:void(0);', 'http://m.yp.ca?tracking=globeandmail']}), dict(name='div', attrs={'id':['ShareArticles', 'topStories', 'seealsobottom']}) ] def postprocess_html(self, soup, first_fetch): # Find and preserve single page article layout, can be first or last allArts = soup.findAll(True, {'id':'article'}) if len(allArts)==2: if(len(allArts[0].contents)>len(allArts[1].contents)): allArts[1].extract() else: allArts[0].extract() return soup def parse_feeds(self, *args, **kwargs): parsed_feeds = BasicNewsRecipe.parse_feeds(self, *args, **kwargs) # Eliminate the duplicates urlSet = set() for feed in parsed_feeds: newArticles = [] for article in feed: if article.url in urlSet: feed.articles.remove( article ) else: urlSet.add(article.url) newArticles.append(article) feed.articles = newArticles return parsed_feeds # cover_url = 'http://www.freewarepocketpc.net/wp7/img/the-globe-and-mail.png' #Use the mobile version rather than the web version def print_version(self, url): return (url.replace('cmpid=rss1','service=mobile')).replace('http://www.','http://m.') |
01-16-2011, 12:20 AM | #4 |
Connoisseur
Posts: 99
Karma: 170
Join Date: Nov 2010
Location: Airdrie Alberta
Device: Sony 650
|
More articles deleted
"Added code to find and eliminate duplicating articles across multiple feeds (i.e., the same article showing up in Business and Investing)"
Spoiler:
If you run the recipe without this and with it you will notice that when it deletes an article it also deletes the article below it that does not occur anywhere else so you actually lose articles. Also Spoiler:
All this did was get rid of the links to the rest of the pages but did not add the rest of the article from the other pages |
01-16-2011, 01:19 AM | #5 |
creator of calibre
Posts: 44,356
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Guys, when you settle on a final recipe, let me know so I can update the builtin one in calibre.
|
Advert | |
|
01-16-2011, 01:49 PM | #6 |
Junior Member
Posts: 9
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
|
mufc, are you trying to cut and paste pieces into old recipe? It will not work.
It feels like you still have this in your recipe: Code:
#this has to be here or the text in the article appears twice. remove_tags_after = [dict(id='article')] |
01-16-2011, 03:02 PM | #7 |
Connoisseur
Posts: 99
Karma: 170
Join Date: Nov 2010
Location: Airdrie Alberta
Device: Sony 650
|
U R RIGHT
I was testing that code in another globe recipe and having problems. I tried yours this morning and it works fine. My apologies.
Let me ask you a question. The code for removing articles. Is it carried out prior to downloading. Why I ask this is my recipe does not include all sections of the G + M. The extra articles I lose could be because they show up first in a section that I do not download. Is that possible ? Could not get the single page layout to work at all. Maybe I should just modify your recipe to suit my Sony 650. Your recipe seems to have given me more questions than answers. How much of this is generic and could they be adapted to other recipes Spoiler:
Also this is the first instance I have seen of using the mobile instead of web version. What is your reasoning and could it be adapted for other recipes ? |
01-16-2011, 04:39 PM | #8 |
Connoisseur
Posts: 99
Karma: 170
Join Date: Nov 2010
Location: Airdrie Alberta
Device: Sony 650
|
I use you recipe now with a couple of tweaks for my Sony 650. Your recipe is quite brilliant actually. I only wish I could learn how to adapt your ideas into a couple of my own recipes
|
01-21-2011, 01:05 PM | #9 |
Junior Member
Posts: 9
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
|
Thanks man, really appreciate good word!
Would you mind sharing back your "couple of tweaks" for all us Globe fans? |
01-21-2011, 09:06 PM | #10 |
Connoisseur
Posts: 99
Karma: 170
Join Date: Nov 2010
Location: Airdrie Alberta
Device: Sony 650
|
This is made for my Sony 650. Gets rid of images and hyperlinks etc.
Spoiler:
|
Tags |
canada, globe & mail, news |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
WSJ: E-Books Rewrite Bookselling | markbot | News | 21 | 05-25-2010 01:02 PM |
NYT: Textbooks That Professors Can Rewrite Digitally | ekaser | News | 7 | 03-01-2010 01:27 PM |
Calibre Globe and Mail recipe and Sony PRS-600 | elvenic | Calibre | 13 | 01-15-2010 12:06 PM |
Help with Boston Globe RSS recipe | horsegoalie | Calibre | 14 | 12-17-2009 06:56 PM |
Greetings and offering to help on the wiki rewrite | orchidpop | Introduce Yourself | 10 | 05-11-2008 11:12 PM |