I was looking for natgeo recipe and guess what? i found none.
I mean how can anybody(most of you) not miss natgeo.
So i have concocted this recipe for all you natgeo fans.
It works great, i have taken special care of css formatting.
But there are two flaws with the recipe and i am sure somebody will be able to
me with it .
(
this is the natgeo feed)
- feed contains gallery (http://url+*/picture/*) pages (unlike normal article pages) and while handling them things get messier.
Problem : Those pages are not formatted at all according to my css code.
Plus i looked into index.html file generated for this article (gallery page) and it doesn't contain <html> or <body> or <head> tags but <div> tag (or anyother) directly. So i think, since there is no head tag then style tags are not getting embedded and hence the problem.
So i need some way to selectively embed head,body,head,style tag in gallery pages (they have /pictures/ in their url) so as to correct this problem. Normal article pages have no such problem, their index.html file contains all the tags.
If this can't be done then how do i skip those pages, remember only way to recognize gallery pages is that 'pictures' is present in the url
- feed contains few 'Presented By' links which are not article or gallery ((http://url+*/picture/*) page but ad pages which i need to skip from table of contents.
and now heres the code
Code:
from calibre.web.feeds.news import BasicNewsRecipe
class NatGeo(BasicNewsRecipe):
title = u'National Geographic'
oldest_article = 8
max_articles_per_feed = 20
encoding = 'utf8'
publisher = 'nationalgeographic.com'
category = 'science, nat geo'
__author__ = 'gagsays'
masthead_url = 'http://s.ngeo.com/wpf/sites/themes/global/i/presentation/ng_logo_small.png'
description = 'Inspiring people to care about the planet since 1888'
timefmt = ' [%a, %d %b, %Y]'
no_stylesheets = True
use_embedded_content = False
extra_css = '''
body {color: #000000;font-size: medium;}
h1 {color: #222222; font-size: large; font-weight:lighter; text-decoration:none; text-align: center;font-family:Georgia,Times New Roman,Times,serif;}
h2 {color: #454545; font-size: small; font-weight:lighter; text-decoration:none; text-align: justify; font-style:italic;font-family :Georgia,Times New Roman,Times,serif;}
h3 {color: #555555; font-size: small; font-style:italic; margin-top: 10px;}
img{margin-bottom: 0.25em;display:block;margin-left: auto;margin-right: auto;}
a:link,a,.a,href {text-decoration: none;color: #000000;}
.caption{color: #000000;font-size: xx-small;text-align: justify;font-weight:normal;}
.credit{color: #555555;font-size: xx-small;text-align: left;font-weight:lighter;}
p.author,p.publication{color: #000000;font-size: xx-small;text-align: left;display:inline;}
p.publication_time{color: #000000;font-size: xx-small;text-align: right;text-decoration: underline;}
p {margin-bottom: 0;}
p + p {text-indent: 1.5em;margin-top: 0;}
.hidden{display:none;}
#page_head{text-transform:uppercase;}
'''
########################################################
def preprocess_html(self, soup):
for alink in soup.findAll('a'):
if alink.string is not None:
tstr = alink.string
alink.replaceWith(tstr)
return soup
#######################################################
remove_tags_before = dict(id='page_head')
keep_only_tags = [
dict(name='div',attrs={'id':['page_head','content_mainA']})
]
remove_tags_after = [
dict(name='div',attrs={'class':['article_text','promo_collection']})
]
remove_tags = [
dict(name='div', attrs={'class':['aside','primary full_width']})
,dict(name='div', attrs={'id':['header_search','navigation_mainB_wrap']})
]
feeds = [
(u'Daily News', u'http://feeds.nationalgeographic.com/ng/News/News_Main')
]