View Single Post
Old 03-24-2012, 10:15 AM   #1
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 204
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
shortlist.com recipe update

24/3/12

Uses soup to get correct cover
have set oldest article to 7 days as site updates weekly
Max articles per feed set to 10
Spoiler:


Code:
import urllib, re, mechanize
from calibre import __appname__
from calibre.utils.magick import Image, PixelWand
class AdvancedUserRecipe1324663493(BasicNewsRecipe):
    title          = u'Shortlist'
    description = 'Articles From Shortlist.com using feed43.'
    # I've set oldest article to 7 days as the website updates weekly
    oldest_article = 7
    max_articles_per_feed = 10
    remove_empty_feeds = True
    remove_javascript     = True
    no_stylesheets = True
    __author__ = 'Dave Asbury'
    # last updated 24/03/12
    language = 'en_GB'
    def get_cover_url(self):
            soup = self.index_to_soup('http://www.newsstand.co.uk/512-Weekly-Mens-Magazines/13810-Subscribe-to-SHORTLIST-Magazine-Subscription.aspx')
            #cov = soup.findl('img', attrs : alt= '"SHORTLIST Magazine UK')
            cov = soup.find(attrs={'id' : 'ContentPlaceHolder1_ctl00_imgCoverShot'}) 
            #print '******** ',cov,' ***'
            cover_url = 'http://www.newsstand.co.uk'+cov['src']
            return cover_url

    masthead_url = 'http://www.mediauk.com/logos/100/344096.png'

    #auto_cleanup_keep = '//*[@class="hero-image"]'
    #auto_cleanup_keep = '//*[@class="article "]'

    #auto_cleanup = True
    preprocess_regexps = [
    (re.compile(r'…or.*?email to your friends</a>.', re.IGNORECASE | re.DOTALL), lambda match: '')]

    keep_only_tags = [
	          dict(name='h1'),
	          dict(name='h2',attrs={'class' : 'title'}),
                              dict(name='h3',atts={'class' : 'subheading'}),
	          dict(attrs={'class' : [ 'hero-static','stand-first']}), 
                              dict(attrs={'class' : 'hero-image'}),
       	          dict(name='div',attrs={'id' : ['list','article','article alternate']}),
	          dict(name='div',attrs={'class' : 'stand-first'}),
          #dict(name='p')

        ]
    remove_tags = [dict(name='h2',attrs={'class' : 'graphic-header'}),
	       dict(attrs={'id' : ['share','twitter','facebook','digg','delicious','facebook-like']}),
	       dict(atts={'class' : ['related-content','related-content-item','related-content horizontal','more']}),

	]

    remove_tags_after = [dict(name='p',attrs={'id' : 'tags'})
	]

    feeds          = [
                               (u'This Weeks Issue', u'http://feed43.com/0323588208751786.xml'),
	     (u'Instant Improver', u'http://feed43.com/1236541026275417.xml'),
	     (u'Cool Stuff',u'http://feed43.com/6253845228768456.xml'),
                                (u'Style',u'http://feed43.com/7217107577215678.xml'),
                                (u'Films',u'http://feed43.com/3101308515277265.xml'),
	     (u'Music',u'http://feed43.com/2416400550560162.xml'),
	     (u'TV',u'http://feed43.com/4781172470717123.xml'),
	     (u'Sport',u'http://feed43.com/5303151885853308.xml'),
	     (u'Gaming',u'http://feed43.com/8883764600355347.xml'),
                                (u'Women',u'http://feed43.com/2648221746514241.xml'),
	#(u'Articles', u'http://feed43.com/3428534448355545.xml')
	]

Last edited by scissors; 03-24-2012 at 11:27 AM. Reason: Max articles per feed set back to 10 reduce size of file
scissors is offline   Reply With Quote