Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 03-02-2016, 12:30 AM   #1
rty
Zealot
rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.
 
Posts: 108
Karma: 6066
Join Date: Apr 2010
Location: Singapore
Device: iPad Air, Kindle DXG, Kindle Paperwhite
Straits Times (Singapore)

The current recipe by Mr Darko Miletic for the Singapore Straits Times is no longer working. I made some changes here to make it work again.

I don't why most of the images won't load. Is it because they are enclosed within the <IMAGES> tags or they use "IMG SRCSET" instead of "IMG SRC"?

Spoiler:
Code:
__license__   = 'GPL v3'
__copyright__ = '2009, Darko Miletic <darko.miletic at gmail.com>'
'''
www.straitstimes.com
'''

import re
from calibre.web.feeds.recipes import BasicNewsRecipe

class StraitsTimes(BasicNewsRecipe):
    title                  = 'The Straits Times'
    __author__             = 'Darko Miletic'
    description            = 'Singapore newspaper'
    oldest_article         = 2
    max_articles_per_feed  = 100
    no_stylesheets         = True
    use_embedded_content   = False
    encoding               = 'utf-8'
    publisher              = 'Singapore Press Holdings Ltd.'
    category               = 'news, politics, singapore, asia'
    language               = 'en_SG'

    conversion_options = {
                             'comments'  : description
                            ,'tags'      : category
                            ,'language'  : language
                            ,'publisher' : publisher
                         }

    preprocess_regexps = [
                           (re.compile(
                            r'<meta name="description" content="[^"]+"\s*/?>',
                            re.IGNORECASE|re.DOTALL),
                            lambda m:''),
                           (re.compile(r'<!--.+?-->', re.IGNORECASE|re.DOTALL),
                               lambda m: ''),
                         ]
    remove_tags = [
        dict(name=['object','link','map', 'style']),
        dict(attrs={'class':'dropdown-menu'}),
    ]

    keep_only_tags = [
                        dict(name='h1', attrs={'class':'headline node-title'}),
                        dict(name='div', attrs={'class':'media-group'}),
                        dict(name='div', attrs={'itemprop':'articleBody'})
                    ]
    remove_tags_after=dict(name='div',attrs={'class':'story-keywords hidden-print '})

    feeds = [
               (u'Singapore'       , u'http://www.straitstimes.com/news/singapore/rss.xml' )
              ,(u'Asia'            , u'http://www.straitstimes.com/news/asia/rss.xml'       )
              ,(u'Business'        , u'http://www.straitstimes.com/news/business/rss.xml'     )
              ,(u'Sport'           , u'http://www.straitstimes.com/news/sport/rss.xml'     )
              ,(u'World'           , u'http://www.straitstimes.com/news/world/rss.xml'     )
              ,(u'Lifestyle'       , u'http://www.straitstimes.com/news/lifestyle/rss.xml' )
            ]

    def preprocess_html(self, soup):
        for a in soup.findAll('a', attrs={'class':'thumb'}):
            img = a.find('img')
            if img is not None:
                img['src'] = a['href']
        return soup
rty is offline   Reply With Quote
Old 03-02-2016, 06:23 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,349
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Many sites nowadays use dynamic img loading. Usually the correct img url is in some data attribute and is added dynamically after the page is loaded. You can emulate that in preprocess_html by copy it from the data- attribute to the src attribute.
kovidgoyal is offline   Reply With Quote
Advert
Old 01-29-2017, 09:28 PM   #3
edwardwong
Junior Member
edwardwong began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Jan 2017
Device: Kindle Paperwhite 2016
Hi,

I just updated the recipe to include RSS feeds for the print edition as well as the images fix. Just replace the portions below.

Some articles are showing up blank (Header, but no body) eg. the "Opinion" section . Still trying to figure that one out.


Code:
    feeds = [
               (u'Top of the News' , u'http://www.straitstimes.com/print/top-of-the-news/rss.xml' )
              ,(u'World'           , u'http://www.straitstimes.com/print/world/rss.xml'       )
              ,(u'Home'            , u'http://www.straitstimes.com/print/home/rss.xml'     )
              ,(u'Business'        , u'http://www.straitstimes.com/print/business/rss.xml'     )
              ,(u'Life'            , u'http://www.straitstimes.com/print/life/rss.xml'     )
              ,(u'Science'         , u'http://www.straitstimes.com/print/science/rss.xml' )
              ,(u'Digital'         , u'http://www.straitstimes.com/print/digital/rss.xml'     )
              ,(u'Insight'         , u'http://www.straitstimes.com/print/insight/rss.xml'     )
              ,(u'Opinion'         , u'http://www.straitstimes.com/print/opinion/rss.xml'     )
              ,(u'Forum'           , u'http://www.straitstimes.com/print/forum/rss.xml' )
              ,(u'Big Picture'     , u'http://www.straitstimes.com/print/big-picture/rss.xml' )
              ,(u'Community'       , u'http://www.straitstimes.com/print/community/rss.xml' )
              ,(u'Education'       , u'http://www.straitstimes.com/print/education/rss.xml' )
]

    def preprocess_html(self, soup):
        for img in soup.findAll('img', srcset=True):
            img['src'] = img['srcset'].partition(' ')[0]
            img['srcset'] = ''
        return soup
edwardwong is offline   Reply With Quote
Old 02-10-2018, 12:45 AM   #4
mrshister
Junior Member
mrshister has learned how to buy an e-book online
 
Posts: 1
Karma: 90
Join Date: Feb 2018
Device: Kindle Paperwhite 2016
I happened to work on this separately, and here's code that worked for me.

Code:
__license__ = 'GPL v3'
__copyright__ = '2017, mrshister'
'''
www.straitstimes.com
'''

import re
from calibre.web.feeds.recipes import BasicNewsRecipe


class StraitsTimes(BasicNewsRecipe):
    title = 'The Straits Times'
    __author__ = 'mrshister'
    description = 'Singapore newspaper'
    oldest_article = 2
    max_articles_per_feed = 100
    no_stylesheets = True
    use_embedded_content = False
    encoding = 'utf-8'
    publisher = 'Singapore Press Holdings Ltd.'
    category = 'news, politics, singapore, asia'
    language = 'en_SG'

    conversion_options = {
        'comments': description, 'tags': category, 'language': language, 'publisher': publisher
    }

    preprocess_regexps = [
        (re.compile(
            r'<meta name="description" content="[^"]+"\s*/?>',
            re.IGNORECASE | re.DOTALL),
         lambda m:''),
        (re.compile(r'<!--.+?-->', re.IGNORECASE | re.DOTALL),
         lambda m: ''),
    ]
  
    headline_reg_exp = 'headline node-title'         # Headline
    img_reg_exp = 'media-entity'                     # Main Image
    body_reg_exp = 'odd\sfield-item'                 # Article Body
    subheadline_reg_exp = 'node-subheadline'         # Sub-headline
    related_reg_exp = '^.*related_story.*$'          # Related Stories

    keep_only_tags = [
               dict(name='h1', attrs={'class': re.compile(headline_reg_exp, re.IGNORECASE)}) 
               ,dict(name='figure', attrs={'itemprop': re.compile(img_reg_exp, re.IGNORECASE)})
               ,dict(name='div', attrs={'class': 'story-postdate'})    # Publish time
               ,dict(name='h2', attrs={'class': re.compile(subheadline_reg_exp, re.IGNORECASE)})
               ,dict(name='div', attrs={'class': re.compile(body_reg_exp, re.IGNORECASE)})    # Article Body
    
    ]
    
    remove_tags = [
               dict(name='div', attrs={'class': re.compile(related_reg_exp, re.IGNORECASE)})
    ]
    
    remove_tags_after = dict(name='div', attrs={'class': 'hr_thin'})
    

    feeds = [
              (u'Top of the News' , u'http://www.straitstimes.com/print/top-of-the-news/rss.xml')
              ,(u'World'           , u'http://www.straitstimes.com/print/world/rss.xml')
              ,(u'Home'            , u'http://www.straitstimes.com/print/home/rss.xml')
              ,(u'Business'        , u'http://www.straitstimes.com/print/business/rss.xml')
              ,(u'Life'            , u'http://www.straitstimes.com/print/life/rss.xml')
              ,(u'Science'         , u'http://www.straitstimes.com/print/science/rss.xml')
              ,(u'Digital'         , u'http://www.straitstimes.com/print/digital/rss.xml')
              ,(u'Insight'         , u'http://www.straitstimes.com/print/insight/rss.xml')
              ,(u'Opinion'         , u'http://www.straitstimes.com/print/opinion/rss.xml')
              ,(u'Forum'           , u'http://www.straitstimes.com/print/forum/rss.xml')
              ,(u'Big Picture'     , u'http://www.straitstimes.com/print/big-picture/rss.xml')
              ,(u'Community'       , u'http://www.straitstimes.com/print/community/rss.xml')
              ,(u'Education'       , u'http://www.straitstimes.com/print/education/rss.xml')
    ]

    def preprocess_html(self, soup):
        for img in soup.findAll('img', srcset=True):
            img['src'] = img['srcset'].partition(' ')[0]
            img['srcset'] = ''
        return soup

Last edited by mrshister; 02-10-2018 at 12:55 AM.
mrshister is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Times of Judgment, Book 6 of The End Times Saga Cliff Ball Self-Promotions by Authors and Publishers 0 06-09-2014 11:48 AM
Times of Destruction: Christian End Times Thriller Cliff Ball Self-Promotions by Authors and Publishers 0 02-15-2014 11:56 AM
The Best of Times, The Worst of Times 2012 - Your Fave & Least Fave Reads of the Year sun surfer Reading Recommendations 37 12-13-2012 10:32 AM
Times of Trial: an End Times novel (Book 2) Cliff Ball Self-Promotions by Authors and Publishers 0 05-16-2012 12:56 PM
PRS-900 hi! i'm from Singapore, is this product possible to use in Singapore? nelson7lim Sony Reader 20 07-03-2010 11:08 AM


All times are GMT -4. The time now is 02:54 PM.


MobileRead.com is a privately owned, operated and funded community.