MobileRead Forums - View Single Post - National Geographic Recipe (need some improvement)

gagsays · 05-18-2011, 06:02 PM

I was looking for natgeo recipe and guess what? i found none.
I mean how can anybody(most of you) not miss natgeo.

So i have concocted this recipe for all you natgeo fans.
It works great, i have taken special care of css formatting.
But there are two flaws with the recipe and i am sure somebody will be able to

me with it .
(this is the natgeo feed)

feed contains gallery (http://url+*/picture/*) pages (unlike normal article pages) and while handling them things get messier.
Problem : Those pages are not formatted at all according to my css code.
Plus i looked into index.html file generated for this article (gallery page) and it doesn't contain <html> or <body> or <head> tags but <div> tag (or anyother) directly. So i think, since there is no head tag then style tags are not getting embedded and hence the problem.
So i need some way to selectively embed head,body,head,style tag in gallery pages (they have /pictures/ in their url) so as to correct this problem. Normal article pages have no such problem, their index.html file contains all the tags.

If this can't be done then how do i skip those pages, remember only way to recognize gallery pages is that 'pictures' is present in the url
feed contains few 'Presented By' links which are not article or gallery ((http://url+*/picture/*) page but ad pages which i need to skip from table of contents.

and now heres the code

Code:

from calibre.web.feeds.news import BasicNewsRecipe
class NatGeo(BasicNewsRecipe):
    title          = u'National Geographic'
    oldest_article = 8
    max_articles_per_feed = 20
    encoding              = 'utf8'
    publisher              = 'nationalgeographic.com'
    category               = 'science, nat geo'	
    __author__           = 'gagsays'
    masthead_url        = 'http://s.ngeo.com/wpf/sites/themes/global/i/presentation/ng_logo_small.png'
    description           = 'Inspiring people to care about the planet since 1888'
    timefmt = ' [%a, %d %b, %Y]'
    no_stylesheets        = True
    use_embedded_content  = False

    extra_css = '''
    	        body {color: #000000;font-size: medium;}
                h1 {color: #222222; font-size: large; font-weight:lighter; text-decoration:none; text-align: center;font-family:Georgia,Times New Roman,Times,serif;}
	      h2 {color: #454545; font-size: small; font-weight:lighter; text-decoration:none; text-align: justify; font-style:italic;font-family :Georgia,Times New Roman,Times,serif;}
                h3 {color: #555555; font-size: small; font-style:italic; margin-top: 10px;}
                img{margin-bottom: 0.25em;display:block;margin-left: auto;margin-right: auto;}
                a:link,a,.a,href {text-decoration: none;color: #000000;}	
				.caption{color: #000000;font-size: xx-small;text-align: justify;font-weight:normal;}
                .credit{color: #555555;font-size: xx-small;text-align: left;font-weight:lighter;}
				p.author,p.publication{color: #000000;font-size: xx-small;text-align: left;display:inline;}
				p.publication_time{color: #000000;font-size: xx-small;text-align: right;text-decoration: underline;}
				p {margin-bottom: 0;}
                p + p {text-indent: 1.5em;margin-top: 0;}
                .hidden{display:none;}
				#page_head{text-transform:uppercase;}
               '''
########################################################
    def preprocess_html(self, soup):
        for alink in soup.findAll('a'):
            if alink.string is not None:
               tstr = alink.string
               alink.replaceWith(tstr)
        return soup
#######################################################
    remove_tags_before = dict(id='page_head')
    keep_only_tags = [
	     dict(name='div',attrs={'id':['page_head','content_mainA']})
	]
    remove_tags_after = [
         dict(name='div',attrs={'class':['article_text','promo_collection']})
    ]
    remove_tags    = [
	             	           dict(name='div', attrs={'class':['aside','primary full_width']})
	             	           ,dict(name='div', attrs={'id':['header_search','navigation_mainB_wrap']})
								]	
    feeds = [
					(u'Daily News', u'http://feeds.nationalgeographic.com/ng/News/News_Main')
					]