View Single Post
Old 09-29-2013, 07:34 AM   #1
RayV
Junior Member
RayV began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Sep 2013
Device: Kobo Mini
Does Calibre recognise <image> tags?

I'm downloading articles from Telegraph UK RSS feed http://www.telegraph.co.uk/news/worldnews/rss with the builtin recipe.

The images in the web page in <div id="mainBodyArea" ..> referenced in the <image.. tags are not being saved in the Calibre generated epub.

Example from the web page:

<image refid="3783387" version="c" width="460" height="287" caption="" declared-caption="" src="http://may-be-another-web-site.com/multimedia/archive/03783/picture.jpg" photographer="" name=""></image>

is saved as:

<image src="http://may-be-another-web-site.com/multimedia/archive/03783/picture.jpg" version="c" caption="" photographer="" height="287" width="460" declared-caption="" refid="3783387" name=""/>

in the Calibre epub.



I modified the recipe using Re-usable code 'sticky' #21 "Embed images into an ebook" by kiavash
to change the <image> tags to <img> and it worked - all images are now being embedded in the Calibre epub.


So, is the problem that Calibre doesn't recognise <image> tags?


Modified recipe:

Spoiler:

__license__ = 'GPL v3'
__copyright__ = '2008-2010, Darko Miletic <darko.miletic at gmail.com>'
'''
telegraph.co.uk
'''

from calibre.web.feeds.news import BasicNewsRecipe

class TelegraphUK(BasicNewsRecipe):
title = 'Telegraph World News-4'
__author__ = 'Darko Miletic and Sujata Raman'
description = 'News from United Kingdom'
oldest_article = 1
category = 'news, politics, UK'
publisher = 'Telegraph Media Group ltd.'
max_articles_per_feed = 12
no_stylesheets = True
language = 'en_GB'
remove_empty_feeds = True
use_embedded_content = False

extra_css = '''
h1{font-family :Arial,Helvetica,sans-serif; font-size:1.2 em; }
h2{font-family :Times; font-size:1 em; font-style: italic; color:#444444;}
.story{font-family :Arial,Helvetica,sans-serif; font-size: .6 em;}
.byline{color:#666666; font-family :Arial,Helvetica,sans-serif; font-size: .6 em; font-style: italic}
#a{color:#234B7B; }
.imageExtras{color:#666666; font-family :Arial,Helvetica,sans-serif; font-size: .6 em;}
.caption {font-family :Times; font-size: .7 em; font-style: italic}
sup {font-family :Times; font-size: .7 em; font-style: italic}
'''

conversion_options = {
'comment' : description
, 'tags' : category
, 'publisher' : publisher
, 'language' : language
}


keep_only_tags = [
dict(name='div', attrs={'class':['storyHead','byline']})
,dict(name='div', attrs={'id':'mainBodyArea' })
]
remove_tags = [dict(name='div', attrs={'class':['related_links_inline',"imgindex","next","prev","g utterUnder",'ssImgHide','imageExtras','ssImg hide','related_links_video']})
,dict(name='ul' , attrs={'class':['shareThis shareBottom']})
,dict(name='span', attrs={'class':['num','placeComment','credit']})
]

feeds = [
(u'World News' , u'http://www.telegraph.co.uk/news/worldnews/rss' )
]

# Ref: https://www.mobileread.com/forums/sho...0&postcount=21
def preprocess_html(self, soup):
# Includes all the figures inside the final ebook
# Finds all the jpg links
for figure in soup.findAll('image', attrs = {'src' : lambda x: x and 'jpg' in x}):
figure.name = 'img' # converts the links to img
return soup

def populate_article_metadata(self, article, soup, first):
if first and hasattr(self, 'add_toc_thumbnail'):
picdiv = soup.find('img')
if picdiv is not None:
self.add_toc_thumbnail(article,picdiv['src'])

def get_article_url(self, article):
url = article.get('link', None)
if 'picture-galleries' in url or 'pictures' in url or 'picturegalleries' in url :
url = None
return url
RayV is offline   Reply With Quote