Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 05-18-2011, 06:02 PM   #1
gagsays
Member
gagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheese
 
Posts: 20
Karma: 1000
Join Date: Oct 2009
Device: kindle 3 wifi
National Geographic Recipe (need some improvement)

I was looking for natgeo recipe and guess what? i found none.
I mean how can anybody(most of you) not miss natgeo.

So i have concocted this recipe for all you natgeo fans.
It works great, i have taken special care of css formatting.
But there are two flaws with the recipe and i am sure somebody will be able to me with it .
(this is the natgeo feed)
  1. feed contains gallery (http://url+*/picture/*) pages (unlike normal article pages) and while handling them things get messier.
    Problem : Those pages are not formatted at all according to my css code.
    Plus i looked into index.html file generated for this article (gallery page) and it doesn't contain <html> or <body> or <head> tags but <div> tag (or anyother) directly. So i think, since there is no head tag then style tags are not getting embedded and hence the problem.
    So i need some way to selectively embed head,body,head,style tag in gallery pages (they have /pictures/ in their url) so as to correct this problem. Normal article pages have no such problem, their index.html file contains all the tags.

    If this can't be done then how do i skip those pages, remember only way to recognize gallery pages is that 'pictures' is present in the url
  2. feed contains few 'Presented By' links which are not article or gallery ((http://url+*/picture/*) page but ad pages which i need to skip from table of contents.

and now heres the code

Code:
from calibre.web.feeds.news import BasicNewsRecipe
class NatGeo(BasicNewsRecipe):
    title          = u'National Geographic'
    oldest_article = 8
    max_articles_per_feed = 20
    encoding              = 'utf8'
    publisher              = 'nationalgeographic.com'
    category               = 'science, nat geo'	
    __author__           = 'gagsays'
    masthead_url        = 'http://s.ngeo.com/wpf/sites/themes/global/i/presentation/ng_logo_small.png'
    description           = 'Inspiring people to care about the planet since 1888'
    timefmt = ' [%a, %d %b, %Y]'
    no_stylesheets        = True
    use_embedded_content  = False

    extra_css = '''
    	        body {color: #000000;font-size: medium;}
                h1 {color: #222222; font-size: large; font-weight:lighter; text-decoration:none; text-align: center;font-family:Georgia,Times New Roman,Times,serif;}
	      h2 {color: #454545; font-size: small; font-weight:lighter; text-decoration:none; text-align: justify; font-style:italic;font-family :Georgia,Times New Roman,Times,serif;}
                h3 {color: #555555; font-size: small; font-style:italic; margin-top: 10px;}
                img{margin-bottom: 0.25em;display:block;margin-left: auto;margin-right: auto;}
                a:link,a,.a,href {text-decoration: none;color: #000000;}	
				.caption{color: #000000;font-size: xx-small;text-align: justify;font-weight:normal;}
                .credit{color: #555555;font-size: xx-small;text-align: left;font-weight:lighter;}
				p.author,p.publication{color: #000000;font-size: xx-small;text-align: left;display:inline;}
				p.publication_time{color: #000000;font-size: xx-small;text-align: right;text-decoration: underline;}
				p {margin-bottom: 0;}
                p + p {text-indent: 1.5em;margin-top: 0;}
                .hidden{display:none;}
				#page_head{text-transform:uppercase;}
               '''
########################################################
    def preprocess_html(self, soup):
        for alink in soup.findAll('a'):
            if alink.string is not None:
               tstr = alink.string
               alink.replaceWith(tstr)
        return soup
#######################################################
    remove_tags_before = dict(id='page_head')
    keep_only_tags = [
	     dict(name='div',attrs={'id':['page_head','content_mainA']})
	]
    remove_tags_after = [
         dict(name='div',attrs={'class':['article_text','promo_collection']})
    ]
    remove_tags    = [
	             	           dict(name='div', attrs={'class':['aside','primary full_width']})
	             	           ,dict(name='div', attrs={'id':['header_search','navigation_mainB_wrap']})
								]	
    feeds = [
					(u'Daily News', u'http://feeds.nationalgeographic.com/ng/News/News_Main')
					]
gagsays is offline   Reply With Quote
Old 05-18-2011, 06:24 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,296
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You can use the postprocess_html method to modify the downloaded html of any page.
kovidgoyal is offline   Reply With Quote
Advert
Old 05-18-2011, 06:28 PM   #3
gagsays
Member
gagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheese
 
Posts: 20
Karma: 1000
Join Date: Oct 2009
Device: kindle 3 wifi
Quote:
Originally Posted by kovidgoyal View Post
You can use the postprocess_html method to modify the downloaded html of any page.
but how to do it selectively (for gallery pages only)
gagsays is offline   Reply With Quote
Old 05-18-2011, 09:53 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,296
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
check the html it should have something in it that identifies it. If you find the something, do the processing.
kovidgoyal is offline   Reply With Quote
Old 05-19-2011, 03:33 AM   #5
gagsays
Member
gagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheese
 
Posts: 20
Karma: 1000
Join Date: Oct 2009
Device: kindle 3 wifi
k how do i check if <body> or <html> is present in the generated file. It might be simple but i am new to soup stuff. I looked into online API documentation and other recipes but i still can't figure out what to write (actually python syntax) in def postprocess_html(self,soup).
can somebody give me the required snippet for the logic given below
if html tag present then
{
return soup
}
else
{
embed html and body tags in the html file
return soup
}

Though I know other languages like c/c++, java but I am new to python and beautiful soup stuff. My only experience with python is writing small recipes by studying other recipes and online API documentation. So my problem might be trivial but I don't know the required syntax.

So guys, I know, what or how(logically) I have to do but just can't figure out what I have to write(syntax).

Please help
gagsays is offline   Reply With Quote
Advert
Old 05-19-2011, 12:13 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,296
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
has_body = soup.find('body') is not None
kovidgoyal is offline   Reply With Quote
Old 05-19-2011, 12:21 PM   #7
gagsays
Member
gagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheesegagsays can extract oil from cheese
 
Posts: 20
Karma: 1000
Join Date: Oct 2009
Device: kindle 3 wifi
Quote:
Originally Posted by kovidgoyal View Post
has_body = soup.find('body') is not None
thanks
gagsays is offline   Reply With Quote
Reply

Tags
natgeo, national geographic

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
That's an improvement! caheaton Kobo Reader 8 04-11-2011 01:04 PM
Wall Street Journal, WSJ, Free version, recipe improvement for full text of all ar winterescape Recipes 16 02-07-2011 01:51 PM
National Geographic Request Oilfieldtrash Recipes 5 11-29-2010 08:35 AM
The Complete National Geographic (released Oct. 30) Syniurge News 38 11-25-2009 11:29 AM


All times are GMT -4. The time now is 07:10 PM.


MobileRead.com is a privately owned, operated and funded community.