|
|
#1 |
|
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 108
Karma: 6066
Join Date: Apr 2010
Location: Singapore
Device: iPad Air, Kindle DXG, Kindle Paperwhite
|
Straits Times (Singapore)
The current recipe by Mr Darko Miletic for the Singapore Straits Times is no longer working. I made some changes here to make it work again.
I don't why most of the images won't load. Is it because they are enclosed within the <IMAGES> tags or they use "IMG SRCSET" instead of "IMG SRC"? Spoiler:
|
|
|
|
|
|
#2 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609
Karma: 28549044
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Many sites nowadays use dynamic img loading. Usually the correct img url is in some data attribute and is added dynamically after the page is loaded. You can emulate that in preprocess_html by copy it from the data- attribute to the src attribute.
|
|
|
|
| Advert | |
|
|
|
|
#3 |
|
Junior Member
![]() Posts: 2
Karma: 10
Join Date: Jan 2017
Device: Kindle Paperwhite 2016
|
Hi,
I just updated the recipe to include RSS feeds for the print edition as well as the images fix. Just replace the portions below. Some articles are showing up blank (Header, but no body) eg. the "Opinion" section . Still trying to figure that one out. Code:
feeds = [
(u'Top of the News' , u'http://www.straitstimes.com/print/top-of-the-news/rss.xml' )
,(u'World' , u'http://www.straitstimes.com/print/world/rss.xml' )
,(u'Home' , u'http://www.straitstimes.com/print/home/rss.xml' )
,(u'Business' , u'http://www.straitstimes.com/print/business/rss.xml' )
,(u'Life' , u'http://www.straitstimes.com/print/life/rss.xml' )
,(u'Science' , u'http://www.straitstimes.com/print/science/rss.xml' )
,(u'Digital' , u'http://www.straitstimes.com/print/digital/rss.xml' )
,(u'Insight' , u'http://www.straitstimes.com/print/insight/rss.xml' )
,(u'Opinion' , u'http://www.straitstimes.com/print/opinion/rss.xml' )
,(u'Forum' , u'http://www.straitstimes.com/print/forum/rss.xml' )
,(u'Big Picture' , u'http://www.straitstimes.com/print/big-picture/rss.xml' )
,(u'Community' , u'http://www.straitstimes.com/print/community/rss.xml' )
,(u'Education' , u'http://www.straitstimes.com/print/education/rss.xml' )
]
def preprocess_html(self, soup):
for img in soup.findAll('img', srcset=True):
img['src'] = img['srcset'].partition(' ')[0]
img['srcset'] = ''
return soup
|
|
|
|
|
|
#4 |
|
Junior Member
![]() Posts: 1
Karma: 90
Join Date: Feb 2018
Device: Kindle Paperwhite 2016
|
I happened to work on this separately, and here's code that worked for me.
Code:
__license__ = 'GPL v3'
__copyright__ = '2017, mrshister'
'''
www.straitstimes.com
'''
import re
from calibre.web.feeds.recipes import BasicNewsRecipe
class StraitsTimes(BasicNewsRecipe):
title = 'The Straits Times'
__author__ = 'mrshister'
description = 'Singapore newspaper'
oldest_article = 2
max_articles_per_feed = 100
no_stylesheets = True
use_embedded_content = False
encoding = 'utf-8'
publisher = 'Singapore Press Holdings Ltd.'
category = 'news, politics, singapore, asia'
language = 'en_SG'
conversion_options = {
'comments': description, 'tags': category, 'language': language, 'publisher': publisher
}
preprocess_regexps = [
(re.compile(
r'<meta name="description" content="[^"]+"\s*/?>',
re.IGNORECASE | re.DOTALL),
lambda m:''),
(re.compile(r'<!--.+?-->', re.IGNORECASE | re.DOTALL),
lambda m: ''),
]
headline_reg_exp = 'headline node-title' # Headline
img_reg_exp = 'media-entity' # Main Image
body_reg_exp = 'odd\sfield-item' # Article Body
subheadline_reg_exp = 'node-subheadline' # Sub-headline
related_reg_exp = '^.*related_story.*$' # Related Stories
keep_only_tags = [
dict(name='h1', attrs={'class': re.compile(headline_reg_exp, re.IGNORECASE)})
,dict(name='figure', attrs={'itemprop': re.compile(img_reg_exp, re.IGNORECASE)})
,dict(name='div', attrs={'class': 'story-postdate'}) # Publish time
,dict(name='h2', attrs={'class': re.compile(subheadline_reg_exp, re.IGNORECASE)})
,dict(name='div', attrs={'class': re.compile(body_reg_exp, re.IGNORECASE)}) # Article Body
]
remove_tags = [
dict(name='div', attrs={'class': re.compile(related_reg_exp, re.IGNORECASE)})
]
remove_tags_after = dict(name='div', attrs={'class': 'hr_thin'})
feeds = [
(u'Top of the News' , u'http://www.straitstimes.com/print/top-of-the-news/rss.xml')
,(u'World' , u'http://www.straitstimes.com/print/world/rss.xml')
,(u'Home' , u'http://www.straitstimes.com/print/home/rss.xml')
,(u'Business' , u'http://www.straitstimes.com/print/business/rss.xml')
,(u'Life' , u'http://www.straitstimes.com/print/life/rss.xml')
,(u'Science' , u'http://www.straitstimes.com/print/science/rss.xml')
,(u'Digital' , u'http://www.straitstimes.com/print/digital/rss.xml')
,(u'Insight' , u'http://www.straitstimes.com/print/insight/rss.xml')
,(u'Opinion' , u'http://www.straitstimes.com/print/opinion/rss.xml')
,(u'Forum' , u'http://www.straitstimes.com/print/forum/rss.xml')
,(u'Big Picture' , u'http://www.straitstimes.com/print/big-picture/rss.xml')
,(u'Community' , u'http://www.straitstimes.com/print/community/rss.xml')
,(u'Education' , u'http://www.straitstimes.com/print/education/rss.xml')
]
def preprocess_html(self, soup):
for img in soup.findAll('img', srcset=True):
img['src'] = img['srcset'].partition(' ')[0]
img['srcset'] = ''
return soup
Last edited by mrshister; 02-10-2018 at 01:55 AM. |
|
|
|
![]() |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Times of Judgment, Book 6 of The End Times Saga | Cliff Ball | Self-Promotions by Authors and Publishers | 0 | 06-09-2014 12:48 PM |
| Times of Destruction: Christian End Times Thriller | Cliff Ball | Self-Promotions by Authors and Publishers | 0 | 02-15-2014 12:56 PM |
| The Best of Times, The Worst of Times 2012 - Your Fave & Least Fave Reads of the Year | sun surfer | Reading Recommendations | 37 | 12-13-2012 11:32 AM |
| Times of Trial: an End Times novel (Book 2) | Cliff Ball | Self-Promotions by Authors and Publishers | 0 | 05-16-2012 01:56 PM |
| PRS-900 hi! i'm from Singapore, is this product possible to use in Singapore? | nelson7lim | Sony Reader | 20 | 07-03-2010 12:08 PM |