05-18-2011, 06:02 PM | #1 |
Member
Posts: 20
Karma: 1000
Join Date: Oct 2009
Device: kindle 3 wifi
|
National Geographic Recipe (need some improvement)
I was looking for natgeo recipe and guess what? i found none.
I mean how can anybody(most of you) not miss natgeo. So i have concocted this recipe for all you natgeo fans. It works great, i have taken special care of css formatting. But there are two flaws with the recipe and i am sure somebody will be able to me with it . (this is the natgeo feed)
and now heres the code Code:
from calibre.web.feeds.news import BasicNewsRecipe class NatGeo(BasicNewsRecipe): title = u'National Geographic' oldest_article = 8 max_articles_per_feed = 20 encoding = 'utf8' publisher = 'nationalgeographic.com' category = 'science, nat geo' __author__ = 'gagsays' masthead_url = 'http://s.ngeo.com/wpf/sites/themes/global/i/presentation/ng_logo_small.png' description = 'Inspiring people to care about the planet since 1888' timefmt = ' [%a, %d %b, %Y]' no_stylesheets = True use_embedded_content = False extra_css = ''' body {color: #000000;font-size: medium;} h1 {color: #222222; font-size: large; font-weight:lighter; text-decoration:none; text-align: center;font-family:Georgia,Times New Roman,Times,serif;} h2 {color: #454545; font-size: small; font-weight:lighter; text-decoration:none; text-align: justify; font-style:italic;font-family :Georgia,Times New Roman,Times,serif;} h3 {color: #555555; font-size: small; font-style:italic; margin-top: 10px;} img{margin-bottom: 0.25em;display:block;margin-left: auto;margin-right: auto;} a:link,a,.a,href {text-decoration: none;color: #000000;} .caption{color: #000000;font-size: xx-small;text-align: justify;font-weight:normal;} .credit{color: #555555;font-size: xx-small;text-align: left;font-weight:lighter;} p.author,p.publication{color: #000000;font-size: xx-small;text-align: left;display:inline;} p.publication_time{color: #000000;font-size: xx-small;text-align: right;text-decoration: underline;} p {margin-bottom: 0;} p + p {text-indent: 1.5em;margin-top: 0;} .hidden{display:none;} #page_head{text-transform:uppercase;} ''' ######################################################## def preprocess_html(self, soup): for alink in soup.findAll('a'): if alink.string is not None: tstr = alink.string alink.replaceWith(tstr) return soup ####################################################### remove_tags_before = dict(id='page_head') keep_only_tags = [ dict(name='div',attrs={'id':['page_head','content_mainA']}) ] remove_tags_after = [ dict(name='div',attrs={'class':['article_text','promo_collection']}) ] remove_tags = [ dict(name='div', attrs={'class':['aside','primary full_width']}) ,dict(name='div', attrs={'id':['header_search','navigation_mainB_wrap']}) ] feeds = [ (u'Daily News', u'http://feeds.nationalgeographic.com/ng/News/News_Main') ] |
05-18-2011, 06:24 PM | #2 |
creator of calibre
Posts: 44,296
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You can use the postprocess_html method to modify the downloaded html of any page.
|
Advert | |
|
05-18-2011, 06:28 PM | #3 |
Member
Posts: 20
Karma: 1000
Join Date: Oct 2009
Device: kindle 3 wifi
|
|
05-18-2011, 09:53 PM | #4 |
creator of calibre
Posts: 44,296
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
check the html it should have something in it that identifies it. If you find the something, do the processing.
|
05-19-2011, 03:33 AM | #5 |
Member
Posts: 20
Karma: 1000
Join Date: Oct 2009
Device: kindle 3 wifi
|
k how do i check if <body> or <html> is present in the generated file. It might be simple but i am new to soup stuff. I looked into online API documentation and other recipes but i still can't figure out what to write (actually python syntax) in def postprocess_html(self,soup).
can somebody give me the required snippet for the logic given below if html tag present then { return soup } else { embed html and body tags in the html file return soup } Though I know other languages like c/c++, java but I am new to python and beautiful soup stuff. My only experience with python is writing small recipes by studying other recipes and online API documentation. So my problem might be trivial but I don't know the required syntax. So guys, I know, what or how(logically) I have to do but just can't figure out what I have to write(syntax). Please help |
Advert | |
|
05-19-2011, 12:13 PM | #6 |
creator of calibre
Posts: 44,296
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
has_body = soup.find('body') is not None
|
05-19-2011, 12:21 PM | #7 |
Member
Posts: 20
Karma: 1000
Join Date: Oct 2009
Device: kindle 3 wifi
|
|
Tags |
natgeo, national geographic |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
That's an improvement! | caheaton | Kobo Reader | 8 | 04-11-2011 01:04 PM |
Wall Street Journal, WSJ, Free version, recipe improvement for full text of all ar | winterescape | Recipes | 16 | 02-07-2011 01:51 PM |
National Geographic Request | Oilfieldtrash | Recipes | 5 | 11-29-2010 08:35 AM |
The Complete National Geographic (released Oct. 30) | Syniurge | News | 38 | 11-25-2009 11:29 AM |