![]() |
#1 |
Member
![]() Posts: 18
Karma: 10
Join Date: Mar 2014
Device: Kindle Paperwhite 1st Gen
|
Content missing in the final step of book creation
I'm trying to enhance the inbuilt Economic Times of India recipe but running into certain problems.
The recipe pulls in mobile print version of the articles using the RSS feeds. In these articles the main content is located in a <div class="storycontent"> tag. The heading, article summary etc. are there properly in the final ebook. But somehow the main content portion alone in the above mentoned tag is missing in the final ebook. I checked the ./debug/processed/feed_0/article_0/index.html file and the above tag alongwith the content was present. So, this means there is something wrong with the calibre converter. A link to a sample article - http://m.economictimes.com/PDAET/art...w/38499011.cms My recipe code Code:
__license__ = 'GPL v3' __copyright__ = '2008-2010, Darko Miletic <darko.miletic at gmail.com>' ''' economictimes.indiatimes.com ''' from calibre.web.feeds.news import BasicNewsRecipe class TheEconomicTimes(BasicNewsRecipe): title = 'The Economic Times India' __author__ = 'Darko Miletic' description = 'Financial news from India' publisher = 'economictimes.indiatimes.com' category = 'news, finances, politics, India' oldest_article = 2 max_articles_per_feed = 100 no_stylesheets = True use_embedded_content = False simultaneous_downloads = 1 encoding = 'utf-8' language = 'en_IN' publication_type = 'newspaper' masthead_url = 'http://economictimes.indiatimes.com/photo/2676871.cms' extra_css = """ body{font-family: Arial,Helvetica,sans-serif} """ conversion_options = {'comment' : description, 'tags' : category, 'publisher' : publisher, 'language' : language } #remove_tags_before = dict(name='h1') #remove_tags_after = dict(name='div', attrs={'class':'spacebw'}) feeds = [(u'All articles', u'http://economictimes.indiatimes.com/rssfeedsdefault.cms')] #Uses the mobile print version. For web print version use 'http://economictimes.indiatimes.com/articleshow/<article_id>?prtpage=1' def print_version(self, url): rest, sep, article_id = url.rpartition('/articleshow/') return 'http://m.economictimes.com/PDAET/articleshow/' + article_id def get_article_url(self, article): rurl = article.get('guid', None) if (rurl.find('/quickieslist/') > 0) or (rurl.find('/quickiearticleshow/') > 0): return None return rurl def preprocess_html(self, soup): #for item in soup.findAll(style=True): #del item['style'] return soup def postprocess_html(self, soup, first_fetch): return self.adeify_images(soup) Content of ./debug/processed/feed_0/article_0/index.html Code:
<?xml version='1.0' encoding='utf-8'?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Last-Modified" content="16 Jul, 2237hrs IST"/> <title>First rate hike 'likely' early 2015, says Dallas Fed President Richard Fisher - The Economic Times on Mobile</title> <meta name="description" content="The Federal Reserve's policy-setting panel is 'likely' to start raising rates in early 2015, if not sooner, a top Fed official said on Wednesday."/> <meta name="keywords" content="US Federal reserve,US central bank,University of Southern California,Richard Fisher,Rate hike,President"/> <link xmlns="" rel="shortcut icon" href="http://m.economictimes.com/icons/etfavicon.ico"/> <meta name="viewport" content="width=device-width; initial-scale=1.0; maximum-scale=1.0; user-scalable=0;"/> <meta name="apple-mobile-web-app-capable" content="yes"/> <meta name="HandheldFriendly" content="true"/> <meta name="MobileOptimized" content="width"/> <config xmlns="http://www.w3.org/1999/xhtml" key="2147477890"/> <config/> <config xmlns="http://www.w3.org/1999/xhtml" datetimeformat="yyyy"/> <config datetimeformat="yyyy"> <link rel="canonical" href="http://economictimes.indiatimes.com/news/international/business/first-rate-hike-likely-early-2015-says-dallas-fed-president-richard-fisher/articleshow/38499011.cms"/> </config> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> <link href="../../stylesheet.css" rel="stylesheet" type="text/css"/> <link href="../../page_styles.css" rel="stylesheet" type="text/css"/> </head> <body class="calibre"><div class="calibrenavbar">| <a href="../article_1/index.html">Next</a> | <a href="../index.html#article_0">Section Menu</a> | <a href="../../index.html#feed_0">Main Menu</a> | <hr class="calibre6"/> </div><div class="calibre5"><a href="/rssfeeds/26519199.cms"><div class="calibre5"><img alt="ET MOBILE RSS" class="calibre2" src="images/img1.jpg"/><br class="calibre5"/></div></a><span>16 Jul, 2237hrs IST</span><a href="http://economictimes.indiatimes.com/">Full Site</a></div><div class="calibre5"><a href="/"><div class="calibre5"><img alt="ET MOBILE" src="images/img2.png" class="calibre2"/><br class="calibre5"/></div></a></div><div class="calibre5"><table width="98%" border="0" cellspacing="0" cellpadding="0" class="calibre7"><tr class="calibre8"><td class="bold" width="10%" valign="top">Sensex</td><td width="30%" class="bold">25549.72</td><td width="30%" class="bold"><span>**321.07**<div class="calibre5"><img alt="Sensex Decrease" title="Sensex Decrease" src="images/img3.png" class="calibre2"/><br class="calibre5"/></div></span></td><td width="30%" class="bold"><span>**1.27% **<div class="calibre5"><img alt="Sensex Decrease" title="Sensex Decrease" src="images/img3.png" class="calibre2"/><br class="calibre5"/></div></span></td></tr><tr class="calibre8"><td class="bold" width="10%" valign="top">Nifty</td><td width="30%" class="bold">7624.40</td><td width="30%" class="bold"><span>**97.75**<div class="calibre5"><img alt="Sensex Decrease" title="Sensex Decrease" src="images/img3.png" class="calibre2"/><br class="calibre5"/></div></span></td><td width="30%" class="bold"><span>**1.30% **<div class="calibre5"><img alt="Sensex Decrease" title="Sensex Decrease" src="images/img3.png" class="calibre2"/><br class="calibre5"/></div></span></td></tr></table><form action="/stockquotes.cms" method="get" name="stockfrm" class="calibre5"><div class="calibre5"><input onclick="quote_blank();" value="Get Quote" size="20" name="ticker" type="text"/>**<input name="B1" value="Go" type="submit"/><a title="Mobile Apps" href="/mobileapps.cms"><div class="calibre5"><img alt="Mobile Apps" src="images/img4.png" class="calibre2"/><br class="calibre5"/></div></a></div></form></div><hr class="calibre6"/><a href="/">Home</a> | <a href="/budget2014.cms">Budget 2014</a> | <a href="/market/1977021501.cms?exchange=n&exchangeid=50">Markets</a> | <a href="/industry/13352306.cms">Industry</a> | <a href="/articlelist/32897620.cms">ET Panache</a> | <a href="/summary.cms?idx=1">Portfolio</a> | <a href="/allsections.cms">All Sections</a> | <a href="http://epaper.timesofindia.com/index.asp">mPaper</a><hr xmlns="http://www.w3.org/1999/xhtml" class="calibre6"/><div xmlns="" style="width:100%;text-align:center;"></div><div xmlns="http://www.w3.org/1999/xhtml" class="calibre5"><div xmlns="http://www.w3.org/1999/xhtml" class="calibre5"><div class="calibre5"><img alt="" hspace="5" src="images/img6.png" class="calibre2"/><br class="calibre5"/></div><a href="/mail/38499011.cms">E-mail this</a></div><h2 xmlns="http://www.w3.org/1999/xhtml" class="calibre9">BUSINESS</h2></div><div class="calibre5"><config showseo="1" showslide="1" showrelatedarticle="1" datetimeformat="d mmm, yyyy, hhnn 'hrs IST'"><h1 class="calibre10">First rate hike 'likely' early 2015, says Dallas Fed President Richard Fisher</h1><div class="calibre5"><artdate>16 Jul, 2014, 2232 hrs IST</artdate>,*<artag>Reuters</artag></div><div class="calibre5"><div class="calibre5"><a href="/PDAET/quickiearticleshow/38499028.cms"><div class="calibre5"><img alt="" class="calibre2" src="images/img7.jpg"/><br class="calibre5"/></div></a></div><div class="calibre5">The Federal Reserve's policy-setting panel is 'likely' to start raising rates in early 2015, if not sooner, a top Fed official said on Wednesday.</div></div><div xmlns="" class="storycontent"> LOS ANGELES: The Federal Reserve's policy-setting panel is 'likely' to start raising rates in early 2015, if not sooner, a top Fed official said on Wednesday. <br/><br/> The prediction from Dallas Fed President Richard Fisher went beyond his prepared remarks to the University of Southern California, in which he said the Fed "may well" raise rates in early 2015. Futures traders currently expect a first rate rise in mid-2015. <br/><br/> The rate rises will likely come in "gradual increments," he said. <br/><br/> Fisher is a voting member of the US central bank's policy-setting committee this year. <meta content="cms.next" name="cmsei"/></div></config></div><br class="calibre5"/><div xmlns="" class="spacebw"><div id="ad36070" name="ad36070" align="center"></div></div><br xmlns="http://www.w3.org/1999/xhtml" class="calibre5"/><div id="mob_add" class="calibre5"></div><hr xmlns=""/><a href="/">Home</a> | <a href="/budget2014.cms">Budget 2014</a> | <a href="/market/1977021501.cms?exchange=n&exchangeid=50">Markets</a> | <a href="/industry/13352306.cms">Industry</a> | <a href="/articlelist/32897620.cms">ET Panache</a> | <a href="/summary.cms?idx=1">Portfolio</a> | <a href="/allsections.cms">All Sections</a> | <a href="http://epaper.timesofindia.com/index.asp">mPaper</a><br class="calibre5"/>To Download ET Apps, pls <a href="http://m.economictimes.com/mobileapps.cms">click here<div class="calibre5"><img alt="ET MOBILE" src="images/img9.png" class="calibre2"/><br class="calibre5"/></div></a><hr class="calibre6"/>Other Mobile Sites: <a href="http://m.timesofindia.com/">TOI MOBILE</a>, <a href="http://m.indiatimes.com">Indiatimes</a>, <a title="Follo" href="http://m.follo.co.in">follo</a>, <a title="GreetZap" href="http://m.greetzap.in">GreetZap</a>, <a title="Alive" href="http://aliveapp.in">Alive</a><br class="calibre5"/><a title="TimesJobs Mobile" href="http://m.timesjobs.com?src=etm">Job Search</a> | <a title="MagicBricks Mobile" href="http://m.magicbricks.com?source=etm">Property Search</a> | <a title="Ads2Book Mobile" href="http://m.ads2book.com?src=etm">Post Print Ad</a><hr class="calibre6"/><div class="calibre5">Copyright ©*2014*Bennett Coleman & Co. All rights reserved.<br class="calibre5"/>Powered by Indiatimes. <a href="http://m.economictimes.com/termsofuse.cms" class="calibre11">Terms of Use and Grievance Redressal Policy</a><span class="calibre12"> |</span><a href="/privacypolicy.cms" class="calibre13">Privacy Policy</a></div><config xmlns="http://www.w3.org/1999/xhtml" gaaccountid="MO-12812017-2"><div class="calibre5"><img src="images/img10.png" class="calibre2"/><br class="calibre5"/></div><p class="hidden"><div class="calibre5"><img id="hiddenImg" alt="*" class="calibre2"/><br class="calibre5"/></div></p></config><div class="calibrenavbar"> <hr class="calibre6"/> <p class="calibre14">This article was downloaded by <strong class="calibre15">calibre</strong> from <a href="http://economictimes.indiatimes.com/news/international/business/first-rate-hike-likely-early-2015-says-dallas-fed-president-richard-fisher/articleshow/38499011.cms">http://economictimes.indiatimes.com/news/international/business/first-rate-hike-likely-early-2015-says-dallas-fed-president-richard-fisher/articleshow/38499011.cms</a></p> <br class="calibre5"/><br class="calibre5"/> | <a href="../index.html#article_0">Section Menu</a> | <a href="../../index.html#feed_0">Main Menu</a> | </div></body></html> Last edited by hashken; 07-16-2014 at 02:03 PM. |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,231
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I ran you recipe and I dont see that, here is the processed html for one article
Spoiler:
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Member
![]() Posts: 18
Karma: 10
Join Date: Mar 2014
Device: Kindle Paperwhite 1st Gen
|
Hi Kovid,
Your output too has the <div xmlns="" class="storycontent"> tag. It is present in the longest line in your output. It is this tag and it's contents that form the main portion of the article and this is just not appearing in the final .mobi file |
![]() |
![]() |
![]() |
#4 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,231
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I'm confused are you saying the content is missing from the processed html or from the final book?
|
![]() |
![]() |
![]() |
#5 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,231
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
In any case just add
remove_attributes = ['xmlns'] to the recipe to take care of it. |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Member
![]() Posts: 18
Karma: 10
Join Date: Mar 2014
Device: Kindle Paperwhite 1st Gen
|
The content is present in the processed HTML. As you can see in my original post, it is present in the index.html in the processed folder.
The content is only missing in the final book. |
![]() |
![]() |
![]() |
#7 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,231
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
See my previous post
|
![]() |
![]() |
![]() |
#8 |
Member
![]() Posts: 18
Karma: 10
Join Date: Mar 2014
Device: Kindle Paperwhite 1st Gen
|
Surprisingly, removing "xmlns" attribute seemed to make everything work fine.
Is this an existing bug or is this supposed to be the expected behaviour and if so why? |
![]() |
![]() |
![]() |
#9 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,231
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
It is expected behavior. When a tag in an xhtml document declares its namespace to be something other than the XHTML namespace, which is what xmlns="" does, that tag is no longer part of the html document and the converter ignores it.
|
![]() |
![]() |
![]() |
#10 |
Member
![]() Posts: 18
Karma: 10
Join Date: Mar 2014
Device: Kindle Paperwhite 1st Gen
|
Oh, didn't know that. Thanks for the prompt replies. Keep us the good work.
|
![]() |
![]() |
![]() |
#11 |
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Jan 2016
Device: Kindle
|
Modified the code to fix Economic Times Downloaded content
Recently, Economic Times changed the guid tags to a text message that broke it again. Just fixed it or I feel it I did it. Moreover, pointing to the mobile site is not working very well so pointed back to the old print version URL. Code attached.
Code:
__license__ = 'GPL v3' __copyright__ = '2008-2014, Karthik <hashkendistro@gmail.com>, Darko Miletic <darko.miletic at gmail.com>' ''' economictimes.indiatimes.com ''' from calibre.web.feeds.news import BasicNewsRecipe class TheEconomicTimes(BasicNewsRecipe): title = 'The Economic Times India' __author__ = 'Karthik K, Darko Miletic' description = 'Financial news from India' publisher = 'economictimes.indiatimes.com' category = 'news, finances, politics, India' oldest_article = 1 max_articles_per_feed = 100 no_stylesheets = True #use_embedded_content = False simultaneous_downloads = 1 encoding = 'utf-8' language = 'en_IN' publication_type = 'newspaper' masthead_url = 'http://economictimes.indiatimes.com/photo/2676871.cms' extra_css = """ body{font-family: Arial,Helvetica,sans-serif} .foto_mg{font-size: 60%; font-weight: 700;} h1{font-size: 150%;} artdate{font-size: 60%} artag{font-size: 60%} div.storycontent{padding-top: 10px} """ conversion_options = {'comment' : description, 'tags' : category, 'publisher' : publisher, 'language' : language } remove_tags_before = dict(name='article') remove_tags_after = [dict(name='article')] remove_tags = [dict(name='div', attrs={'class':'cmtLinks'}), dict(name='div', attrs={'class':'raltedTopics'}), dict(name='div', attrs={'class':'editorsPick'}), dict(name='div', attrs={'class':'articleImg etSpecial'}), dict(name='div', attrs={'class':'articleImg artAd'}), dict(name='div', attrs={'class':'appPromotion'}) ] remove_attributes = ['xmlns'] feeds = [(u'Top Stories', u'http://economictimes.indiatimes.com/rssfeedstopstories.cms'), (u'News', u'http://economictimes.indiatimes.com/News/rssfeeds/1715249553.cms'), (u'Market', u'http://economictimes.indiatimes.com/Markets/markets/rssfeeds/1977021501.cms'), (u'Personal Finance', u'http://economictimes.indiatimes.com/rssfeeds/837555174.cms'), (u'Infotech', u'http://economictimes.indiatimes.com/Infotech/rssfeeds/13357270.cms'), (u'Job', u'http://economictimes.indiatimes.com/Infotech/rssfeeds/107115.cms'), (u'Opinion', u'http://economictimes.indiatimes.com/opinion/opinionshome/rssfeeds/897228639.cms'), (u'Features', u'http://economictimes.indiatimes.com/Features/etfeatures/rssfeeds/1466318837.cms'), (u'Environment', u'http://economictimes.indiatimes.com/rssfeeds/2647163.cms'), (u'NRI', u'http://economictimes.indiatimes.com/rssfeeds/7771250.cms') ] #Uses the mobile print version. For web print version use 'http://economictimes.indiatimes.com/articleshow/<article_id>?prtpage=1' def print_version(self, url): rest, sep, article_id = url.rpartition('/articleshow/') #return 'http://m.economictimes.com/PDAET/articleshow/' + article_id return 'http://economictimes.indiatimes.com/articleshow/' + article_id+ '?prtpage=1' def get_article_url(self, article): rurl = article.get('link', None) if (rurl.find('/quickieslist/') > 0) or (rurl.find('/quickiearticleshow/') > 0): return None return rurl def preprocess_html(self, soup): for item in soup.findAll(style=True): del item['style'] return soup def postprocess_html(self, soup, first_fetch): return self.adeify_images(soup) Last edited by PeterT; 01-31-2016 at 06:51 PM. Reason: Code was unreadable; changed to code tags to preserve spacing |
![]() |
![]() |
![]() |
#12 |
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Jan 2016
Device: Kindle
|
Economic Times Recipe Broken Again and fixed now.
Economic Times again changed its formats so the recipe got broken.
Recipe below: Code:
__license__ = 'GPL v3' __copyright__ = '2008-2014, Karthik <hashkendistro@gmail.com>, Darko Miletic <darko.miletic at gmail.com>' ''' economictimes.indiatimes.com ''' from calibre.web.feeds.news import BasicNewsRecipe class TheEconomicTimes(BasicNewsRecipe): title = 'The Economic Times India' __author__ = 'Karthik <hashkendistro@gmail.com>, Darko Miletic <darko.miletic at gmail.com>' description = 'Financial news from India' publisher = 'economictimes.indiatimes.com' category = 'news, finances, politics, India' oldest_article = 1 max_articles_per_feed = 100 no_stylesheets = True use_embedded_content = False simultaneous_downloads = 1 encoding = 'utf-8' language = 'en_IN' publication_type = 'newspaper' masthead_url = 'http://economictimes.indiatimes.com/photo/2676871.cms' extra_css = """ body{font-family: Arial,Helvetica,sans-serif} .foto_mg{font-size: 60%; font-weight: 700;} h1{font-size: 150%;} artdate{font-size: 60%} artag{font-size: 60%} div.storycontent{padding-top: 10px} """ conversion_options = {'comment' : description, 'tags' : category, 'publisher' : publisher, 'language' : language } remove_tags_before = dict(name='article') remove_tags_after = [dict(name='article')] keep_only_tags = [dict(name='h1', attrs={'class':'title'}), dict(name='div', attrs={'class':'bylineFull'}), dict(name='div', attrs={'class':'articleImg'}), dict(name='div', attrs={'class':'artText'}) ] remove_tags = [dict(name='div', attrs={'class':'cmtLinks'}), dict(name='div', attrs={'class':'raltedTopics'}), dict(name='div', attrs={'class':'editorsPick'}), dict(name='div', attrs={'class':'articleImg etSpecial'}), dict(name='div', attrs={'class':'articleImg artAd'}), dict(name='div', attrs={'class':'appPromotion'}) ] remove_attributes = ['xmlns'] feeds = [(u'Top Stories', u'http://economictimes.indiatimes.com/rssfeedstopstories.cms'), (u'News', u'http://economictimes.indiatimes.com/News/rssfeeds/1715249553.cms'), (u'Market', u'http://economictimes.indiatimes.com/Markets/markets/rssfeeds/1977021501.cms'), (u'Personal Finance', u'http://economictimes.indiatimes.com/rssfeeds/837555174.cms'), (u'Infotech', u'http://economictimes.indiatimes.com/Infotech/rssfeeds/13357270.cms'), (u'Job', u'http://economictimes.indiatimes.com/Infotech/rssfeeds/107115.cms'), (u'Opinion', u'http://economictimes.indiatimes.com/opinion/opinionshome/rssfeeds/897228639.cms'), (u'Features', u'http://economictimes.indiatimes.com/Features/etfeatures/rssfeeds/1466318837.cms'), (u'Environment', u'http://economictimes.indiatimes.com/rssfeeds/2647163.cms'), (u'NRI', u'http://economictimes.indiatimes.com/rssfeeds/7771250.cms') ] # Uses the mobile print version. For web print version use 'http://economictimes.indiatimes.com/articleshow/<article_id>?prtpage=1' def print_version(self, url): rest, sep, article_id = url.rpartition('/articleshow/') # return 'http://m.economictimes.com/PDAET/articleshow/' + article_id return 'http://economictimes.indiatimes.com/articleshow/' + article_id+ '?prtpage=1' def get_article_url(self, article): rurl = article.get('link', None) if (rurl.find('/quickieslist/') > 0) or (rurl.find('/quickiearticleshow/') > 0): return None return rurl def preprocess_html(self, soup): for item in soup.findAll(style=True): del item['style'] return soup def postprocess_html(self, soup, first_fetch): return self.adeify_images(soup) |
![]() |
![]() |
![]() |
Tags |
convert html to epub, recipe |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Step-By-Step Guide to ePub creation | ghostyjack | ePub | 28 | 03-12-2025 12:27 AM |
import HTML as new book - missing content.opf? | sumguy | Editor | 2 | 03-02-2014 07:55 AM |
Calibre Catalog Creation & Kindle 3. What am I missing? | GeekyGal | Introduce Yourself | 3 | 11-10-2010 09:55 PM |
Missing covers, missing content. Getting worse with each sync. | Mememememe | Kobo Reader | 7 | 06-16-2010 09:02 AM |
if:book releases alpha version of Sophie, content creation tool | sic | News | 8 | 04-12-2007 02:28 PM |