Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 09-29-2011, 08:51 AM   #1
zephram
Junior Member
zephram began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Sep 2011
Device: Kindle
Fixed Sydney Morning Herald Recipe

Hi,
The builtin Sydney Morning Herald Recipe had a minor annoying bug - It would insert the text of the "video feedback" form in to each article that has an embedded video on the website. I added the following line to the remove_tags that came after keep_only_tags and it fixed the problem

dict(attrs={'id':'video-player-content'}),

Here's the completed recipe, which now produces much cleaner articles.
Spoiler:
Code:
__license__   = 'GPL v3'
__copyright__ = '2010-2011, Darko Miletic <darko.miletic at gmail.com>'
'''
smh.com.au
'''
from calibre import strftime
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class Smh_au(BasicNewsRecipe):
    title                 = 'The Sydney Morning Herald - Printed edition'
    __author__            = 'Darko Miletic'
    description           = 'Breaking news from Sydney, Australia and the world. Features the latest business, sport, entertainment, travel, lifestyle, and technology news.'
    publisher             = 'Fairfax Digital'
    category              = 'news, politics, Australia, Sydney'
    oldest_article        = 2
    max_articles_per_feed = 200
    no_stylesheets        = True
    encoding              = 'utf-8'
    use_embedded_content  = False
    language              = 'en_AU'
    remove_empty_feeds    = True
    masthead_url          = 'http://images.smh.com.au/2010/02/02/1087188/smh-620.jpg'
    publication_type      = 'newspaper'
    extra_css             = """ 
                                h1{font-family: Georgia,"Times New Roman",Times,serif } 
                                body{font-family: Arial,Helvetica,sans-serif} 
                                .cT-imageLandscape,.cT-imagePortrait{font-size: x-small} 
                            """

    conversion_options = {
                          'comment'   : description
                        , 'tags'      : category
                        , 'publisher' : publisher
                        , 'language'  : language
                        }

    remove_tags = [
                     dict(name='div', attrs={'id':['googleAds','moreGoogleAds','comments']})
                    ,dict(name=['object','embed','iframe'])
                  ]
    remove_tags_after = [dict(name='div',attrs={'class':'articleBody'})]
    keep_only_tags    = [dict(name='div',attrs={'id':'content'})]
    remove_tags       = [ 
                          dict(attrs={'class':'hidden'}), 
                          dict(name=['link','meta','base','embed','object','iframe']),
	      dict(attrs={'id':'video-player-content'}),
                        ]
    remove_attributes = ['width','height','lang']

    def parse_index(self):
        articles = []
        rawc = self.index_to_soup('http://www.smh.com.au/todays-paper',True)
        soup = BeautifulSoup(rawc,fromEncoding=self.encoding)
        for itimg in soup.findAll('img',src=True):
            if itimg['src'].endswith('frontpage.jpg'):
               self.cover_url = itimg['src']

        for item in soup.findAll(attrs={'class':'cN-storyHeadlineLead cfix'}):
            description = ''
            title_prefix = ''
            feed_link = item.find('a',href=True)
            descript = item.find('p')
            if descript:
               description = self.tag_to_string(descript)
            if feed_link:
                url   = feed_link['href']
                title = title_prefix + self.tag_to_string(feed_link)
                date  = strftime(self.timefmt)
                articles.append({
                                  'title'      :title
                                 ,'date'       :date
                                 ,'url'        :url
                                 ,'description':description
                                })
        return [(self.tag_to_string(soup.find('title')), articles)]

    def preprocess_html(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
        for item in soup.findAll('bod'):
            item.name = 'div'
        for item in soup.findAll('img'):
            if not item.has_key('alt'):
               item['alt'] = 'image'
        return soup
zephram is offline   Reply With Quote
Reply

Tags
fix, recipe, smh


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
New Zealand Herald recipe not working bmacaskill Recipes 2 10-31-2011 09:18 PM
South China Morning Post (SCMP) - Hong Kong - Fixed llam Recipes 0 07-02-2011 10:48 PM
E-books lack the magic of the real thing - National Times - Sydney Morning Herald AprilHare News 1 01-08-2010 01:52 AM
It's the year of the e-reader ... - The Sydney Morning Herald AprilHare News 0 01-07-2010 10:18 PM
Recipe for Sydney Daily Telegraph AprilHare Calibre 11 10-06-2008 04:31 PM


All times are GMT -4. The time now is 05:43 AM.


MobileRead.com is a privately owned, operated and funded community.