MobileRead Forums - View Single Post

RayV · 09-29-2013, 08:34 AM

I'm downloading articles from Telegraph UK RSS feed http://www.telegraph.co.uk/news/worldnews/rss with the builtin recipe.

The images in the web page in <div id="mainBodyArea" ..> referenced in the <image.. tags are not being saved in the Calibre generated epub.

Example from the web page:

<image refid="3783387" version="c" width="460" height="287" caption="" declared-caption="" src="http://may-be-another-web-site.com/multimedia/archive/03783/picture.jpg" photographer="" name=""></image>

is saved as:

<image src="http://may-be-another-web-site.com/multimedia/archive/03783/picture.jpg" version="c" caption="" photographer="" height="287" width="460" declared-caption="" refid="3783387" name=""/>

in the Calibre epub.

I modified the recipe using Re-usable code 'sticky' #21 "Embed images into an ebook" by kiavash
to change the <image> tags to <img> and it worked - all images are now being embedded in the Calibre epub.

So, is the problem that Calibre doesn't recognise <image> tags?

Modified recipe:

Spoiler:

09-29-2013, 08:34 AM	#1
RayV Junior Member Posts: 3 Karma: 10 Join Date: Sep 2013 Device: Kobo Mini	Does Calibre recognise <image> tags? I'm downloading articles from Telegraph UK RSS feed http://www.telegraph.co.uk/news/worldnews/rss with the builtin recipe. The images in the web page in <div id="mainBodyArea" ..> referenced in the <image.. tags are not being saved in the Calibre generated epub. Example from the web page: <image refid="3783387" version="c" width="460" height="287" caption="" declared-caption="" src="http://may-be-another-web-site.com/multimedia/archive/03783/picture.jpg" photographer="" name=""></image> is saved as: <image src="http://may-be-another-web-site.com/multimedia/archive/03783/picture.jpg" version="c" caption="" photographer="" height="287" width="460" declared-caption="" refid="3783387" name=""/> in the Calibre epub. I modified the recipe using Re-usable code 'sticky' #21 "Embed images into an ebook" by kiavash to change the <image> tags to <img> and it worked - all images are now being embedded in the Calibre epub. So, is the problem that Calibre doesn't recognise <image> tags? Modified recipe: Spoiler: __license__ = 'GPL v3' __copyright__ = '2008-2010, Darko Miletic <darko.miletic at gmail.com>' ''' telegraph.co.uk ''' from calibre.web.feeds.news import BasicNewsRecipe class TelegraphUK(BasicNewsRecipe): title = 'Telegraph World News-4' __author__ = 'Darko Miletic and Sujata Raman' description = 'News from United Kingdom' oldest_article = 1 category = 'news, politics, UK' publisher = 'Telegraph Media Group ltd.' max_articles_per_feed = 12 no_stylesheets = True language = 'en_GB' remove_empty_feeds = True use_embedded_content = False extra_css = ''' h1{font-family :Arial,Helvetica,sans-serif; font-size:1.2 em; } h2{font-family :Times; font-size:1 em; font-style: italic; color:#444444;} .story{font-family :Arial,Helvetica,sans-serif; font-size: .6 em;} .byline{color:#666666; font-family :Arial,Helvetica,sans-serif; font-size: .6 em; font-style: italic} #a{color:#234B7B; } .imageExtras{color:#666666; font-family :Arial,Helvetica,sans-serif; font-size: .6 em;} .caption {font-family :Times; font-size: .7 em; font-style: italic} sup {font-family :Times; font-size: .7 em; font-style: italic} ''' conversion_options = { 'comment' : description , 'tags' : category , 'publisher' : publisher , 'language' : language } keep_only_tags = [ dict(name='div', attrs={'class':['storyHead','byline']}) ,dict(name='div', attrs={'id':'mainBodyArea' }) ] remove_tags = [dict(name='div', attrs={'class':['related_links_inline',"imgindex","next","prev","g utterUnder",'ssImgHide','imageExtras','ssImg hide','related_links_video']}) ,dict(name='ul' , attrs={'class':['shareThis shareBottom']}) ,dict(name='span', attrs={'class':['num','placeComment','credit']}) ] feeds = [ (u'World News' , u'http://www.telegraph.co.uk/news/worldnews/rss' ) ] # Ref: https://www.mobileread.com/forums/sho...0&postcount=21 def preprocess_html(self, soup): # Includes all the figures inside the final ebook # Finds all the jpg links for figure in soup.findAll('image', attrs = {'src' : lambda x: x and 'jpg' in x}): figure.name = 'img' # converts the links to img return soup def populate_article_metadata(self, article, soup, first): if first and hasattr(self, 'add_toc_thumbnail'): picdiv = soup.find('img') if picdiv is not None: self.add_toc_thumbnail(article,picdiv['src']) def get_article_url(self, article): url = article.get('link', None) if 'picture-galleries' in url or 'pictures' in url or 'picturegalleries' in url : url = None return url