Content missing in the final step of book creation

hashken · 07-16-2014, 02:58 PM

I'm trying to enhance the inbuilt Economic Times of India recipe but running into certain problems.

The recipe pulls in mobile print version of the articles using the RSS feeds. In these articles the main content is located in a <div class="storycontent"> tag.

The heading, article summary etc. are there properly in the final ebook. But somehow the main content portion alone in the above mentoned tag is missing in the final ebook.

I checked the ./debug/processed/feed_0/article_0/index.html file and the above tag alongwith the content was present. So, this means there is something wrong with the calibre converter.

A link to a sample article - http://m.economictimes.com/PDAET/art...w/38499011.cms

My recipe code

Code:

__license__   = 'GPL v3'
__copyright__ = '2008-2010, Darko Miletic <darko.miletic at gmail.com>'
'''
economictimes.indiatimes.com
'''


from calibre.web.feeds.news import BasicNewsRecipe

class TheEconomicTimes(BasicNewsRecipe):
    title                  = 'The Economic Times India'
    __author__             = 'Darko Miletic'
    description            = 'Financial news from India'
    publisher              = 'economictimes.indiatimes.com'
    category               = 'news, finances, politics, India'
    oldest_article         = 2
    max_articles_per_feed  = 100
    no_stylesheets         = True
    use_embedded_content   = False
    simultaneous_downloads = 1
    encoding               = 'utf-8'
    language               = 'en_IN'
    publication_type       = 'newspaper'
    masthead_url           = 'http://economictimes.indiatimes.com/photo/2676871.cms'
    extra_css              = """
                                 body{font-family: Arial,Helvetica,sans-serif}
                             """
    conversion_options     = {'comment'          : description, 
                              'tags'             : category,
                              'publisher'        : publisher,
                              'language'         : language
                             }
    #remove_tags_before     = dict(name='h1')
    #remove_tags_after      = dict(name='div', attrs={'class':'spacebw'})
    feeds                  = [(u'All articles', u'http://economictimes.indiatimes.com/rssfeedsdefault.cms')]


    #Uses the mobile print version. For web print version use 'http://economictimes.indiatimes.com/articleshow/<article_id>?prtpage=1'
    def print_version(self, url):
        rest, sep, article_id = url.rpartition('/articleshow/')
        return 'http://m.economictimes.com/PDAET/articleshow/' + article_id

    def get_article_url(self, article):
        rurl = article.get('guid',  None)
        if (rurl.find('/quickieslist/') > 0) or (rurl.find('/quickiearticleshow/') > 0):
            return None
        return rurl

    def preprocess_html(self, soup):
        #for item in soup.findAll(style=True):
            #del item['style']
        return soup

    def postprocess_html(self, soup, first_fetch):
        return self.adeify_images(soup)

Content of ./debug/processed/feed_0/article_0/index.html

Code:

<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Last-Modified" content="16 Jul, 2237hrs IST"/>
    <title>First rate hike 'likely' early 2015, says Dallas Fed President Richard Fisher - The Economic Times on Mobile</title>
    <meta name="description" content="The Federal Reserve's policy-setting panel is 'likely' to start raising rates in early 2015, if not sooner, a top Fed official said on Wednesday."/>
    <meta name="keywords" content="US Federal reserve,US central bank,University of Southern California,Richard Fisher,Rate hike,President"/>
    <link xmlns="" rel="shortcut icon" href="http://m.economictimes.com/icons/etfavicon.ico"/>
    <meta name="viewport" content="width=device-width; initial-scale=1.0; maximum-scale=1.0; user-scalable=0;"/>
    <meta name="apple-mobile-web-app-capable" content="yes"/>
    <meta name="HandheldFriendly" content="true"/>
    <meta name="MobileOptimized" content="width"/>
    <config xmlns="http://www.w3.org/1999/xhtml" key="2147477890"/>
    <config/>
    <config xmlns="http://www.w3.org/1999/xhtml" datetimeformat="yyyy"/>
    <config datetimeformat="yyyy">
<link rel="canonical" href="http://economictimes.indiatimes.com/news/international/business/first-rate-hike-likely-early-2015-says-dallas-fed-president-richard-fisher/articleshow/38499011.cms"/>
</config>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
  <link href="../../stylesheet.css" rel="stylesheet" type="text/css"/>
<link href="../../page_styles.css" rel="stylesheet" type="text/css"/>
</head>
  <body class="calibre"><div class="calibrenavbar">| <a href="../article_1/index.html">Next</a> | <a href="../index.html#article_0">Section Menu</a> | <a href="../../index.html#feed_0">Main Menu</a> | <hr class="calibre6"/>
</div><div class="calibre5"><a href="/rssfeeds/26519199.cms"><div class="calibre5"><img alt="ET MOBILE RSS" class="calibre2" src="images/img1.jpg"/><br class="calibre5"/></div></a><span>16 Jul, 2237hrs IST</span><a href="http://economictimes.indiatimes.com/">Full Site</a></div><div class="calibre5"><a href="/"><div class="calibre5"><img alt="ET MOBILE" src="images/img2.png" class="calibre2"/><br class="calibre5"/></div></a></div><div class="calibre5"><table width="98%" border="0" cellspacing="0" cellpadding="0" class="calibre7"><tr class="calibre8"><td class="bold" width="10%" valign="top">Sensex</td><td width="30%" class="bold">25549.72</td><td width="30%" class="bold"><span>**321.07**<div class="calibre5"><img alt="Sensex Decrease" title="Sensex Decrease" src="images/img3.png" class="calibre2"/><br class="calibre5"/></div></span></td><td width="30%" class="bold"><span>**1.27%
										**<div class="calibre5"><img alt="Sensex Decrease" title="Sensex Decrease" src="images/img3.png" class="calibre2"/><br class="calibre5"/></div></span></td></tr><tr class="calibre8"><td class="bold" width="10%" valign="top">Nifty</td><td width="30%" class="bold">7624.40</td><td width="30%" class="bold"><span>**97.75**<div class="calibre5"><img alt="Sensex Decrease" title="Sensex Decrease" src="images/img3.png" class="calibre2"/><br class="calibre5"/></div></span></td><td width="30%" class="bold"><span>**1.30%
										**<div class="calibre5"><img alt="Sensex Decrease" title="Sensex Decrease" src="images/img3.png" class="calibre2"/><br class="calibre5"/></div></span></td></tr></table><form action="/stockquotes.cms" method="get" name="stockfrm" class="calibre5"><div class="calibre5"><input onclick="quote_blank();" value="Get Quote" size="20" name="ticker" type="text"/>**<input name="B1" value="Go" type="submit"/><a title="Mobile Apps" href="/mobileapps.cms"><div class="calibre5"><img alt="Mobile Apps" src="images/img4.png" class="calibre2"/><br class="calibre5"/></div></a></div></form></div><hr class="calibre6"/><a href="/">Home</a> | <a href="/budget2014.cms">Budget 2014</a> | <a href="/market/1977021501.cms?exchange=n&amp;exchangeid=50">Markets</a> | <a href="/industry/13352306.cms">Industry</a> | <a href="/articlelist/32897620.cms">ET Panache</a> | <a href="/summary.cms?idx=1">Portfolio</a> | <a href="/allsections.cms">All Sections</a> | <a href="http://epaper.timesofindia.com/index.asp">mPaper</a><hr xmlns="http://www.w3.org/1999/xhtml" class="calibre6"/><div xmlns="" style="width:100%;text-align:center;"></div><div xmlns="http://www.w3.org/1999/xhtml" class="calibre5"><div xmlns="http://www.w3.org/1999/xhtml" class="calibre5"><div class="calibre5"><img alt="" hspace="5" src="images/img6.png" class="calibre2"/><br class="calibre5"/></div><a href="/mail/38499011.cms">E-mail this</a></div><h2 xmlns="http://www.w3.org/1999/xhtml" class="calibre9">BUSINESS</h2></div><div class="calibre5"><config showseo="1" showslide="1" showrelatedarticle="1" datetimeformat="d mmm, yyyy, hhnn  'hrs IST'"><h1 class="calibre10">First rate hike 'likely' early 2015, says Dallas Fed President Richard Fisher</h1><div class="calibre5"><artdate>16 Jul, 2014, 2232  hrs IST</artdate>,*<artag>Reuters</artag></div><div class="calibre5"><div class="calibre5"><a href="/PDAET/quickiearticleshow/38499028.cms"><div class="calibre5"><img alt="" class="calibre2" src="images/img7.jpg"/><br class="calibre5"/></div></a></div><div class="calibre5">The Federal Reserve's policy-setting panel is 'likely' to start raising rates in early 2015, if not sooner, a top Fed official said on Wednesday.</div></div><div xmlns="" class="storycontent"> LOS ANGELES: The Federal Reserve's policy-setting panel is 'likely' to start raising rates in early 2015, if not sooner, a top Fed official said on Wednesday. <br/><br/> The prediction from Dallas Fed President Richard Fisher went beyond his prepared remarks to the University of Southern California, in which he said the Fed "may well" raise rates in early 2015. Futures traders currently expect a first rate rise in mid-2015. <br/><br/> The rate rises will likely come in "gradual increments," he said. <br/><br/> Fisher is a voting member of the US central bank's policy-setting committee this year. <meta content="cms.next" name="cmsei"/></div></config></div><br class="calibre5"/><div xmlns="" class="spacebw"><div id="ad36070" name="ad36070" align="center"></div></div><br xmlns="http://www.w3.org/1999/xhtml" class="calibre5"/><div id="mob_add" class="calibre5"></div><hr xmlns=""/><a href="/">Home</a> | <a href="/budget2014.cms">Budget 2014</a> | <a href="/market/1977021501.cms?exchange=n&amp;exchangeid=50">Markets</a> | <a href="/industry/13352306.cms">Industry</a> | <a href="/articlelist/32897620.cms">ET Panache</a> | <a href="/summary.cms?idx=1">Portfolio</a> | <a href="/allsections.cms">All Sections</a> | <a href="http://epaper.timesofindia.com/index.asp">mPaper</a><br class="calibre5"/>To Download ET Apps, pls <a href="http://m.economictimes.com/mobileapps.cms">click here<div class="calibre5"><img alt="ET MOBILE" src="images/img9.png" class="calibre2"/><br class="calibre5"/></div></a><hr class="calibre6"/>Other Mobile Sites: <a href="http://m.timesofindia.com/">TOI MOBILE</a>, <a href="http://m.indiatimes.com">Indiatimes</a>,
		<a title="Follo" href="http://m.follo.co.in">follo</a>,
		<a title="GreetZap" href="http://m.greetzap.in">GreetZap</a>,
		<a title="Alive" href="http://aliveapp.in">Alive</a><br class="calibre5"/><a title="TimesJobs Mobile" href="http://m.timesjobs.com?src=etm">Job Search</a> | <a title="MagicBricks Mobile" href="http://m.magicbricks.com?source=etm">Property Search</a> | <a title="Ads2Book Mobile" href="http://m.ads2book.com?src=etm">Post Print Ad</a><hr class="calibre6"/><div class="calibre5">Copyright  ©*2014*Bennett Coleman &amp; Co. All rights reserved.<br class="calibre5"/>Powered by Indiatimes. <a href="http://m.economictimes.com/termsofuse.cms" class="calibre11">Terms of Use and Grievance Redressal Policy</a><span class="calibre12"> |</span><a href="/privacypolicy.cms" class="calibre13">Privacy Policy</a></div><config xmlns="http://www.w3.org/1999/xhtml" gaaccountid="MO-12812017-2"><div class="calibre5"><img src="images/img10.png" class="calibre2"/><br class="calibre5"/></div><p class="hidden"><div class="calibre5"><img id="hiddenImg" alt="*" class="calibre2"/><br class="calibre5"/></div></p></config><div class="calibrenavbar">
<hr class="calibre6"/>
<p class="calibre14">This article was downloaded by <strong class="calibre15">calibre</strong> from <a href="http://economictimes.indiatimes.com/news/international/business/first-rate-hike-likely-early-2015-says-dallas-fed-president-richard-fisher/articleshow/38499011.cms">http://economictimes.indiatimes.com/news/international/business/first-rate-hike-likely-early-2015-says-dallas-fed-president-richard-fisher/articleshow/38499011.cms</a></p>
<br class="calibre5"/><br class="calibre5"/> | <a href="../index.html#article_0">Section Menu</a> | <a href="../../index.html#feed_0">Main Menu</a> | </div></body></html>

kovidgoyal · 07-17-2014, 03:16 AM

I ran you recipe and I dont see that, here is the processed html for one article

Spoiler:

Code:

<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Last-Modified" content="17 Jul, 1107hrs IST"/>
    <title>Economy report after one moth of Modi government: Growth looks up, inflation cools - The Economic Times on Mobile</title>
    <meta name="description" content="A series of good data numbers have come out in recent days that suggest the economy is picking up from decade-low growth rates in the past two years."/>
    <meta name="keywords" content="Wholesale price index,united states,Ukraine,State Bank Of India,settlement option,Rohini Malkani,productivity,net worth,Narendra Modi,Modi Government,markets,Insurability,Inflation,ICRA,HSBC,Gold,gdp,economy,Department of Commerce,current account,consumer price index,Citigroup,Bank of India"/>
    <link xmlns="" rel="shortcut icon" href="http://m.economictimes.com/icons/etfavicon.ico"/>
    <meta name="viewport" content="width=device-width; initial-scale=1.0; maximum-scale=1.0; user-scalable=0;"/>
    <meta name="apple-mobile-web-app-capable" content="yes"/>
    <meta name="HandheldFriendly" content="true"/>
    <meta name="MobileOptimized" content="width"/>
    <config xmlns="http://www.w3.org/1999/xhtml" key="2147477890"/>
    <config/>
    <config xmlns="http://www.w3.org/1999/xhtml" datetimeformat="yyyy"/>
    <config datetimeformat="yyyy">
<link rel="canonical" href="http://economictimes.indiatimes.com/news/economy/indicators/economy-report-after-one-moth-of-modi-government-growth-looks-up-inflation-cools/articleshow/38510439.cms"/>
</config>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
  <link href="../../stylesheet.css" rel="stylesheet" type="text/css"/>
<link href="../../page_styles.css" rel="stylesheet" type="text/css"/>
</head>
  <body class="calibre"><div class="calibrenavbar">| <a href="../article_1/index.html">Next</a> | <a href="../index.html#article_0">Section Menu</a> | <a href="../../index.html#feed_0">Main Menu</a> | <hr class="calibre6"/>
</div><div class="calibre5"><a href="/rssfeeds/344531568.cms"><div class="calibre5"><img alt="ET MOBILE RSS" class="calibre2" src="images/img1.jpg"/><br class="calibre5"/></div></a><span>17 Jul, 1107hrs IST</span><a href="http://economictimes.indiatimes.com/">Full Site</a></div><div class="calibre5"><a href="/"><div class="calibre5"><img alt="ET MOBILE" src="images/img2.png" class="calibre2"/><br class="calibre5"/></div></a></div><div class="calibre5"><table width="98%" border="0" cellspacing="0" cellpadding="0" class="calibre7"><tr class="calibre8"><td class="bold" width="10%" valign="top">Sensex</td><td width="30%" class="bold">25592.03</td><td width="30%" class="bold"><span>**42.31**<div class="calibre5"><img alt="Sensex Decrease" title="Sensex Decrease" src="images/img3.png" class="calibre2"/><br class="calibre5"/></div></span></td><td width="30%" class="bold"><span>**0.17%
										**<div class="calibre5"><img alt="Sensex Decrease" title="Sensex Decrease" src="images/img3.png" class="calibre2"/><br class="calibre5"/></div></span></td></tr><tr class="calibre8"><td class="bold" width="10%" valign="top">Nifty</td><td width="30%" class="bold">7638.30</td><td width="30%" class="bold"><span>**13.90**<div class="calibre5"><img alt="Sensex Decrease" title="Sensex Decrease" src="images/img3.png" class="calibre2"/><br class="calibre5"/></div></span></td><td width="30%" class="bold"><span>**0.18%
										**<div class="calibre5"><img alt="Sensex Decrease" title="Sensex Decrease" src="images/img3.png" class="calibre2"/><br class="calibre5"/></div></span></td></tr></table><form action="/stockquotes.cms" method="get" name="stockfrm" class="calibre5"><div class="calibre5"><input onclick="quote_blank();" value="Get Quote" size="20" name="ticker" type="text"/>**<input name="B1" value="Go" type="submit"/><a title="Mobile Apps" href="/mobileapps.cms"><div class="calibre5"><img alt="Mobile Apps" src="images/img4.png" class="calibre2"/><br class="calibre5"/></div></a></div></form></div><hr class="calibre6"/><a href="/">Home</a> | <a href="/budget2014.cms">Budget 2014</a> | <a href="/market/1977021501.cms?exchange=n&amp;exchangeid=50">Markets</a> | <a href="/industry/13352306.cms">Industry</a> | <a href="/articlelist/32897620.cms">ET Panache</a> | <a href="/summary.cms?idx=1">Portfolio</a> | <a href="/allsections.cms">All Sections</a> | <a href="http://epaper.timesofindia.com/index.asp">mPaper</a><hr xmlns="http://www.w3.org/1999/xhtml" class="calibre6"/><div xmlns="" style="width:100%;text-align:center;"></div><div xmlns="http://www.w3.org/1999/xhtml" class="calibre5"><div xmlns="http://www.w3.org/1999/xhtml" class="calibre5"><div class="calibre5"><img alt="" hspace="5" src="images/img6.png" class="calibre2"/><br class="calibre5"/></div><a href="/mail/38510439.cms">E-mail this</a></div><h2 xmlns="http://www.w3.org/1999/xhtml" class="calibre9">INDICATORS</h2></div><div class="calibre5"><config showseo="1" showslide="1" showrelatedarticle="1" datetimeformat="d mmm, yyyy, hhnn  'hrs IST'"><h1 class="calibre10">Economy report after one moth of Modi government: Growth looks up, inflation cools</h1><div class="calibre5"><artdate>17 Jul, 2014, 0722  hrs IST</artdate>,*<artag>ET Bureau</artag></div><div class="calibre5"><div class="calibre5"><a href="/PDAET/quickiearticleshow/38510726.cms"><div class="calibre5"><img alt="" class="calibre2" src="images/img7.jpg"/><br class="calibre5"/></div></a></div><div class="calibre5">The trade deficit was $11.78 billion in June, the highest in a year, but only marginally more than $11.28 billion in May.</div></div><div xmlns="" class="storycontent"><p> NEW DELHI: The first full month under the Narendra Modi government's watch turned out to be a good one for the economy with macro indicators looking up and inflation lower despite lingering monsoon doubts, suggesting that growth could have finally bottomed out.<br/> <br/> Exports rose 10.2% in June from a year ago, the government said on Wednesday, marking yet another positive development following a series of good numbers in recent days that suggest the economy is picking up from decade-low growth rates in the past two years.<br/> <br/> Industrial production rose to a 19-month high of 4.7% in May while car sales rose at their fastest pace in 10 months in June, clearly indicating that the consumer was more confident of the new government shaping recovery.<br/> <br/> Services activity rose to a 17-month high in June on the strength of robust order flow, according to the HSBC Purchasing Managers' Index, indicating rising optimism in the sector that has a share of more than 60% in the economy.<br/> <br/> Imports rose for the first time in a year, at around 8.3%, confirming some sort of recovery in the domestic economy even after discounting for higher gold imports, which rose nearly 65% in June after the Reserve Bank of India eased rules by allowing more entities to import gold.<br/> <br/> India's other big concern, retail inflation, dropped to 7.31% in June, the lowest since the government started reporting consumer price index inflation in January 2012, although the monsoon fears loom large.<br/> <br/> And to top it all, the trade deficit was $11.78 billion in June, the highest in a year, but only marginally more than $11.28 billion in May, according to data released on Wednesday by the commerce department.<br/> <div><img src="images/img8.jpg" class="gwt-Image"/><br/></div><br/> <br/> <br/> Markets cheered the development, with the Sensex rising 1.27% to 25,549.72 points. "The export data is very encouraging, especially the fact that it is led by robust performance of engineering goods, indicating a productivity revival. Given that non-oil, non-gold imports have shown an uptick, industrial production for June will also be quite strong," said Soumya Kanti Ghosh, chief economic advisor, State Bank of India.<br/> <br/> "One can say looking at car sales, manufacturing and exports data that the economy may well have finally bottomed out." That will bode well for the Modi government, which has pledged to turn the economy around while bringing prices under control. The economy could begin the first quarter of the current year at near-5% growth, up from 4.6% in the January-March quarter.<br/> <br/> The decline in global commodity prices will also act as a booster although Iraq and Ukraine are geopolitical sore spots with the potential to reverse the trend. Meanwhile, the June-September monsoon has been patchy although rains have picked up in the past two days.</p> </div><strong class="calibre11">Page 1 of 2 </strong><span></span><a href="/news/economy/indicators/economy-report-after-one-moth-of-modi-government-growth-looks-up-inflation-cools/articleshow/msid-38510439,curpg-2.cms">Next</a></config></div><br class="calibre5"/><div xmlns="" class="spacebw"><div id="ad36070" name="ad36070" align="center"></div></div><br xmlns="http://www.w3.org/1999/xhtml" class="calibre5"/><div id="mob_add" class="calibre5"></div><hr xmlns=""/><a href="/">Home</a> | <a href="/budget2014.cms">Budget 2014</a> | <a href="/market/1977021501.cms?exchange=n&amp;exchangeid=50">Markets</a> | <a href="/industry/13352306.cms">Industry</a> | <a href="/articlelist/32897620.cms">ET Panache</a> | <a href="/summary.cms?idx=1">Portfolio</a> | <a href="/allsections.cms">All Sections</a> | <a href="http://epaper.timesofindia.com/index.asp">mPaper</a><br class="calibre5"/>To Download ET Apps, pls <a href="http://m.economictimes.com/mobileapps.cms">click here<div class="calibre5"><img alt="ET MOBILE" src="images/img10.png" class="calibre2"/><br class="calibre5"/></div></a><hr class="calibre6"/>Other Mobile Sites: <a href="http://m.timesofindia.com/">TOI MOBILE</a>, <a href="http://m.indiatimes.com">Indiatimes</a>,
		<a title="Follo" href="http://m.follo.co.in">follo</a>,
		<a title="GreetZap" href="http://m.greetzap.in">GreetZap</a>,
		<a title="Alive" href="http://aliveapp.in">Alive</a><br class="calibre5"/><a title="TimesJobs Mobile" href="http://m.timesjobs.com?src=etm">Job Search</a> | <a title="MagicBricks Mobile" href="http://m.magicbricks.com?source=etm">Property Search</a> | <a title="Ads2Book Mobile" href="http://m.ads2book.com?src=etm">Post Print Ad</a><hr class="calibre6"/><div class="calibre5">Copyright  ©*2014*Bennett Coleman &amp; Co. All rights reserved.<br class="calibre5"/>Powered by Indiatimes. <a href="http://m.economictimes.com/termsofuse.cms" class="calibre12">Terms of Use and Grievance Redressal Policy</a><span class="calibre13"> |</span><a href="/privacypolicy.cms" class="calibre14">Privacy Policy</a></div><config xmlns="http://www.w3.org/1999/xhtml" gaaccountid="MO-12812017-2"><div class="calibre5"><img src="images/img11.png" class="calibre2"/><br class="calibre5"/></div><p class="hidden"><div class="calibre5"><img id="hiddenImg" alt="*" class="calibre2"/><br class="calibre5"/></div></p></config><div class="calibrenavbar">
<hr class="calibre6"/>
<p class="calibre15">This article was downloaded by <strong class="calibre11">calibre</strong> from <a href="http://economictimes.indiatimes.com/news/economy/indicators/economy-report-after-one-moth-of-modi-government-growth-looks-up-inflation-cools/articleshow/38510439.cms">http://economictimes.indiatimes.com/news/economy/indicators/economy-report-after-one-moth-of-modi-government-growth-looks-up-inflation-cools/articleshow/38510439.cms</a></p>
<br class="calibre5"/><br class="calibre5"/> | <a href="../index.html#article_0">Section Menu</a> | <a href="../../index.html#feed_0">Main Menu</a> | </div></body></html>

hashken · 07-17-2014, 04:49 AM

Hi Kovid,

Your output too has the <div xmlns="" class="storycontent"> tag. It is present in the longest line in your output.

It is this tag and it's contents that form the main portion of the article and this is just not appearing in the final .mobi file

kovidgoyal · 07-17-2014, 04:51 AM

I'm confused are you saying the content is missing from the processed html or from the final book?

kovidgoyal · 07-17-2014, 04:53 AM

In any case just add

remove_attributes = ['xmlns']

to the recipe to take care of it.

hashken · 07-17-2014, 04:54 AM

The content is present in the processed HTML. As you can see in my original post, it is present in the index.html in the processed folder.

The content is only missing in the final book.

kovidgoyal · 07-17-2014, 08:57 AM

See my previous post

hashken · 07-17-2014, 09:33 AM

Surprisingly, removing "xmlns" attribute seemed to make everything work fine.

Is this an existing bug or is this supposed to be the expected behaviour and if so why?

kovidgoyal · 07-17-2014, 09:47 AM

It is expected behavior. When a tag in an xhtml document declares its namespace to be something other than the XHTML namespace, which is what xmlns="" does, that tag is no longer part of the html document and the converter ignores it.

hashken · 07-17-2014, 09:48 AM

Oh, didn't know that. Thanks for the prompt replies. Keep us the good work.

Sambit · 01-31-2016, 06:45 PM

Recently, Economic Times changed the guid tags to a text message that broke it again. Just fixed it or I feel it I did it. Moreover, pointing to the mobile site is not working very well so pointed back to the old print version URL. Code attached.

Code:

__license__   = 'GPL v3'
__copyright__ = '2008-2014, Karthik <hashkendistro@gmail.com>, Darko Miletic <darko.miletic at gmail.com>'
'''
economictimes.indiatimes.com
'''


from calibre.web.feeds.news import BasicNewsRecipe

class TheEconomicTimes(BasicNewsRecipe):
    title                  = 'The Economic Times India'
    __author__             = 'Karthik K, Darko Miletic'
    description            = 'Financial news from India'
    publisher              = 'economictimes.indiatimes.com'
    category               = 'news, finances, politics, India'
    oldest_article         = 1
    max_articles_per_feed  = 100
    no_stylesheets         = True
    #use_embedded_content   = False
    simultaneous_downloads = 1
    encoding               = 'utf-8'
    language               = 'en_IN'
    publication_type       = 'newspaper'
    masthead_url           = 'http://economictimes.indiatimes.com/photo/2676871.cms'
    extra_css              = """
                                 body{font-family: Arial,Helvetica,sans-serif}
                                 .foto_mg{font-size: 60%; 
                                          font-weight: 700;}
                                 h1{font-size: 150%;}
                                 artdate{font-size: 60%}
                                 artag{font-size: 60%}
                                 div.storycontent{padding-top: 10px}
                             """
    conversion_options     = {'comment'          : description, 
                              'tags'             : category,
                              'publisher'        : publisher,
                              'language'         : language
                             }
    remove_tags_before     = dict(name='article')
    remove_tags_after      = [dict(name='article')]
    remove_tags			   = [dict(name='div', attrs={'class':'cmtLinks'}),
                              dict(name='div', attrs={'class':'raltedTopics'}),
                              dict(name='div', attrs={'class':'editorsPick'}),
                              dict(name='div', attrs={'class':'articleImg etSpecial'}),
                              dict(name='div', attrs={'class':'articleImg artAd'}),
                              dict(name='div', attrs={'class':'appPromotion'}) 
                             ]
    remove_attributes      = ['xmlns']
    feeds                  = [(u'Top Stories', u'http://economictimes.indiatimes.com/rssfeedstopstories.cms'),
                              (u'News', u'http://economictimes.indiatimes.com/News/rssfeeds/1715249553.cms'),
                              (u'Market', u'http://economictimes.indiatimes.com/Markets/markets/rssfeeds/1977021501.cms'),
                              (u'Personal Finance', u'http://economictimes.indiatimes.com/rssfeeds/837555174.cms'),
                              (u'Infotech', u'http://economictimes.indiatimes.com/Infotech/rssfeeds/13357270.cms'),
                              (u'Job', u'http://economictimes.indiatimes.com/Infotech/rssfeeds/107115.cms'),
                              (u'Opinion', u'http://economictimes.indiatimes.com/opinion/opinionshome/rssfeeds/897228639.cms'),
                              (u'Features', u'http://economictimes.indiatimes.com/Features/etfeatures/rssfeeds/1466318837.cms'),
                              (u'Environment', u'http://economictimes.indiatimes.com/rssfeeds/2647163.cms'),
                              (u'NRI', u'http://economictimes.indiatimes.com/rssfeeds/7771250.cms')
                            ]



    #Uses the mobile print version. For web print version use 'http://economictimes.indiatimes.com/articleshow/<article_id>?prtpage=1'
    def print_version(self, url):
        rest, sep, article_id = url.rpartition('/articleshow/')
        #return 'http://m.economictimes.com/PDAET/articleshow/' + article_id
        return 'http://economictimes.indiatimes.com/articleshow/' + article_id+ '?prtpage=1'

    def get_article_url(self, article):
        rurl = article.get('link',  None)
        if (rurl.find('/quickieslist/') > 0) or (rurl.find('/quickiearticleshow/') > 0):
            return None
        return rurl

    def preprocess_html(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
        return soup

    def postprocess_html(self, soup, first_fetch):
        return self.adeify_images(soup)

Sambit · 04-30-2016, 05:04 PM

Economic Times again changed its formats so the recipe got broken.

Recipe below:

Code:

__license__   = 'GPL v3'
__copyright__ = '2008-2014, Karthik <hashkendistro@gmail.com>, Darko Miletic <darko.miletic at gmail.com>'
'''
economictimes.indiatimes.com
'''


from calibre.web.feeds.news import BasicNewsRecipe

class TheEconomicTimes(BasicNewsRecipe):
    title                  = 'The Economic Times India'
    __author__             = 'Karthik <hashkendistro@gmail.com>, Darko Miletic <darko.miletic at gmail.com>'
    description            = 'Financial news from India'
    publisher              = 'economictimes.indiatimes.com'
    category               = 'news, finances, politics, India'
    oldest_article         = 1
    max_articles_per_feed  = 100
    no_stylesheets         = True
    use_embedded_content   = False
    simultaneous_downloads = 1
    encoding               = 'utf-8'
    language               = 'en_IN'
    publication_type       = 'newspaper'
    masthead_url           = 'http://economictimes.indiatimes.com/photo/2676871.cms'
    extra_css              = """
                                 body{font-family: Arial,Helvetica,sans-serif}
                                 .foto_mg{font-size: 60%;
                                          font-weight: 700;}
                                 h1{font-size: 150%;}
                                 artdate{font-size: 60%}
                                 artag{font-size: 60%}
                                 div.storycontent{padding-top: 10px}
                             """
    conversion_options     = {'comment'          : description,
                              'tags'             : category,
                              'publisher'        : publisher,
                              'language'         : language
                             }
    remove_tags_before     = dict(name='article')
    remove_tags_after      = [dict(name='article')]
    keep_only_tags		  = [dict(name='h1', attrs={'class':'title'}),
                               dict(name='div', attrs={'class':'bylineFull'}),
                               dict(name='div', attrs={'class':'articleImg'}),
                               dict(name='div', attrs={'class':'artText'})
                              ]
    remove_tags			   = [dict(name='div', attrs={'class':'cmtLinks'}),
                              dict(name='div', attrs={'class':'raltedTopics'}),
                              dict(name='div', attrs={'class':'editorsPick'}),
                              dict(name='div', attrs={'class':'articleImg etSpecial'}),
                              dict(name='div', attrs={'class':'articleImg artAd'}),
                              dict(name='div', attrs={'class':'appPromotion'})
                             ]

    remove_attributes      = ['xmlns']
    feeds                  = [(u'Top Stories', u'http://economictimes.indiatimes.com/rssfeedstopstories.cms'),
                              (u'News', u'http://economictimes.indiatimes.com/News/rssfeeds/1715249553.cms'),
                              (u'Market', u'http://economictimes.indiatimes.com/Markets/markets/rssfeeds/1977021501.cms'),
                              (u'Personal Finance', u'http://economictimes.indiatimes.com/rssfeeds/837555174.cms'),
                              (u'Infotech', u'http://economictimes.indiatimes.com/Infotech/rssfeeds/13357270.cms'),
                              (u'Job', u'http://economictimes.indiatimes.com/Infotech/rssfeeds/107115.cms'),
                              (u'Opinion', u'http://economictimes.indiatimes.com/opinion/opinionshome/rssfeeds/897228639.cms'),
                              (u'Features', u'http://economictimes.indiatimes.com/Features/etfeatures/rssfeeds/1466318837.cms'),
                              (u'Environment', u'http://economictimes.indiatimes.com/rssfeeds/2647163.cms'),
                              (u'NRI', u'http://economictimes.indiatimes.com/rssfeeds/7771250.cms')
                            ]

    # Uses the mobile print version. For web print version use 'http://economictimes.indiatimes.com/articleshow/<article_id>?prtpage=1'
    def print_version(self, url):
        rest, sep, article_id = url.rpartition('/articleshow/')
        # return 'http://m.economictimes.com/PDAET/articleshow/' + article_id
        return 'http://economictimes.indiatimes.com/articleshow/' + article_id+ '?prtpage=1'

    def get_article_url(self, article):
        rurl = article.get('link',  None)
        if (rurl.find('/quickieslist/') > 0) or (rurl.find('/quickiearticleshow/') > 0):
            return None
        return rurl

    def preprocess_html(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
        return soup

    def postprocess_html(self, soup, first_fetch):
        return self.adeify_images(soup)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Step-By-Step Guide to ePub creation	ghostyjack	ePub	34	07-13-2025 02:42 PM
import HTML as new book - missing content.opf?	sumguy	Editor	2	03-02-2014 08:55 AM
Calibre Catalog Creation & Kindle 3. What am I missing?	GeekyGal	Introduce Yourself	3	11-10-2010 10:55 PM
Missing covers, missing content. Getting worse with each sync.	Mememememe	Kobo Reader	7	06-16-2010 10:02 AM
if:book releases alpha version of Sophie, content creation tool	sic	News	8	04-12-2007 03:28 PM

07-17-2014, 04:49 AM	#3
hashken Member Posts: 18 Karma: 10 Join Date: Mar 2014 Device: Kindle Paperwhite 1st Gen	Hi Kovid, Your output too has the <div xmlns="" class="storycontent"> tag. It is present in the longest line in your output. It is this tag and it's contents that form the main portion of the article and this is just not appearing in the final .mobi file

07-17-2014, 04:51 AM	#4
kovidgoyal creator of calibre Posts: 45,724 Karma: 28549306 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I'm confused are you saying the content is missing from the processed html or from the final book?

07-17-2014, 04:53 AM	#5
kovidgoyal creator of calibre Posts: 45,724 Karma: 28549306 Join Date: Oct 2006 Location: Mumbai, India Device: Various	In any case just add remove_attributes = ['xmlns'] to the recipe to take care of it.

07-17-2014, 04:54 AM	#6
hashken Member Posts: 18 Karma: 10 Join Date: Mar 2014 Device: Kindle Paperwhite 1st Gen	The content is present in the processed HTML. As you can see in my original post, it is present in the index.html in the processed folder. The content is only missing in the final book.

07-17-2014, 08:57 AM	#7
kovidgoyal creator of calibre Posts: 45,724 Karma: 28549306 Join Date: Oct 2006 Location: Mumbai, India Device: Various	See my previous post

07-17-2014, 09:33 AM	#8
hashken Member Posts: 18 Karma: 10 Join Date: Mar 2014 Device: Kindle Paperwhite 1st Gen	Surprisingly, removing "xmlns" attribute seemed to make everything work fine. Is this an existing bug or is this supposed to be the expected behaviour and if so why?

07-17-2014, 09:47 AM	#9
kovidgoyal creator of calibre Posts: 45,724 Karma: 28549306 Join Date: Oct 2006 Location: Mumbai, India Device: Various	It is expected behavior. When a tag in an xhtml document declares its namespace to be something other than the XHTML namespace, which is what xmlns="" does, that tag is no longer part of the html document and the converter ignores it.

07-17-2014, 09:48 AM	#10
hashken Member Posts: 18 Karma: 10 Join Date: Mar 2014 Device: Kindle Paperwhite 1st Gen	Oh, didn't know that. Thanks for the prompt replies. Keep us the good work.

Advert

Advert