MobileRead Forums - View Single Post

Szing · 11-15-2010, 01:00 PM

Here is a rewrite of the Globe & Mail recipe.

It has no ads
It has pictures related to the article
News download size is around 1.5-2 MB (epub)
It may have a problem with some multi page articles
It does not have author pictures

Code:

class AdvancedUserRecipe1287083651(BasicNewsRecipe):
    title          = u'Globe & Mail'
    __license__   = 'GPL v3'
    __copyright__ = '2010, Szing'
    oldest_article = 2
    no_stylesheets = True
    max_articles_per_feed = 100
    encoding               = 'utf8'
    publisher              = 'Globe & Mail'
    category               = 'news, Canada, world'
    language               = 'en_CA'
    extra_css = 'p.meta {font-size:75%}\n .redtext {color: red;}\n .byline {font-size: 70%}'

    feeds          = [
      (u'Top National Stories', u'http://www.theglobeandmail.com/news/national/?service=rss'),
      (u'Business', u'http://www.theglobeandmail.com/report-on-business/?service=rss'),
      (u'Commentary', u'http://www.theglobeandmail.com/report-on-business/commentary/?service=rss'),
      (u'Blogs', u'http://www.theglobeandmail.com/blogs/?service=rss'),
      (u'Facts & Arguments', u'http://www.theglobeandmail.com/life/facts-and-arguments/?service=rss'),
      (u'Technology', u'http://www.theglobeandmail.com/news/technology/?service=rss'),
      (u'Investing', u'http://www.theglobeandmail.com/globe-investor/?service=rss'),
      (u'Top Polical Stories', u'http://www.theglobeandmail.com/news/politics/?service=rss'),
      (u'Arts', u'http://www.theglobeandmail.com/news/arts/?service=rss'),
      (u'Life', u'http://www.theglobeandmail.com/life/?service=rss'),
      (u'Real Estate', u'http://www.theglobeandmail.com/real-estate/?service=rss'),
      (u'Auto', u'http://www.theglobeandmail.com/sports/?service=rss'),
      (u'Sports', u'http://www.theglobeandmail.com/auto/?service=rss')
    ]

    keep_only_tags = [
      dict(name='h1'),
      dict(name='h2', attrs={'id':'articletitle'}),
      dict(name='p', attrs={'class':['leadText', 'meta', 'leadImage', 'redtext byline', 'bodyText']}),
      dict(name='div', attrs={'class':['news','articlemeta','articlecopy']}),
      dict(name='id', attrs={'class':'article'}),
      dict(name='table', attrs={'class':'todays-market'}),
      dict(name='header', attrs={'id':'leadheader'})
    ]

    remove_tags = [
      dict(name='div', attrs={'id':['tabInside', 'ShareArticles', 'topStories']})
    ]

    #this has to be here or the text in the article appears twice.
    remove_tags_after = [dict(id='article')]

    #Use the mobile version rather than the web version
    def print_version(self, url):
	return url + '&service=mobile'

-Szing