Custom recipes (archive, read-only) - Page 59

dhiru · 11-20-2009, 09:51 AM

hi kiklop74
could you please make recipe from moneycontrol.com rss feed-
http://www.moneycontrol.com/rss/latestnews.xml
http://www.moneycontrol.com/rss/allstories.xml

thanks

kiklop74 · 11-20-2009, 10:44 AM

Quote:

Originally Posted by dhiru

hi kiklop74
could you please make recipe from moneycontrol.com rss feed-
http://www.moneycontrol.com/rss/latestnews.xml
http://www.moneycontrol.com/rss/allstories.xml

thanks

This site is complicated. Just don't have time to fight with badly formed html.

dhiru · 11-20-2009, 11:09 PM

Quote:

Originally Posted by kiklop74

This site is complicated. Just don't have time to fight with badly formed html.

ok , whenever u have time kindly help. thanks

evanmaastrigt · 11-21-2009, 08:00 AM

Kovid was so kind to add the recipe for the 'Fokke en Sukke' cartoons to the latest version of Calibre. Unfortunately , something went wrong in the conversion from tabs to spaces, breaking the recipe (my bad really, should not have used tabs in the first place).

Here is the corrected version

fokkeensukke.zip

JayCeeEll · 11-21-2009, 08:19 AM

I am working on some new recipes and I am having trouble with the remove_tags pre-processing routine.

The following script should just download the blog entry and comments, but I am also getting the sidebar contents, what am I doing wrong?

An example article is http://www.badscience.net/2009/11/oh-that-was-quick/

PHP Code:


			
__license__   = 'GPL v3'
__copyright__ = '2009, JayCeeEll'

from calibre.web.feeds.news import BasicNewsRecipe

class BadScience(BasicNewsRecipe):
    title                 = u'Bad Science'
    language              = 'en'
    __author__            = 'JayCeeEll'
    description           = 'Bad science in the media'
    author                = 'Ben Goldacre'
    publisher             = 'Ben Goldacre'
    category              = 'blog, skepticism'
    oldest_article        = 7
    max_articles_per_feed = 100
    no_stylesheets        = True
    encoding              = 'utf8'
    remove_javascript     = True
    use_embedded_content  = False

    keep_only_tags = [dict(name='div', attrs={'class':'padded'})]
    
    remove_tags = [
                   dict(name='p', attrs={'class':'meta'})
                  ,dict(name='div', attrs={'id':'respond'})
                  ,dict(name='div', attrs={'id':'sidebar_right'})
                  ]

    feeds = [(u'Bad Science'        , u'http://www.badscience.net/feed/'      )]

evanmaastrigt · 11-21-2009, 09:00 AM

The div with id= sidebar_right, which you want to remove, contains a div with a class= padded, which you want to keep. I think this confuses Calibre a little.

tranqui69 · 11-22-2009, 05:52 AM

First of all: You're awesome!!!.

Could you please make recipe from this two spanish newspapers rss feed?

http://www.levante-emv.com/
http://www.publico.es/

They have rss but i can't do it.

Thank You so much!!

evanmaastrigt · 11-22-2009, 09:01 AM

I am working on a recipe consisting of a couple of RSS feeds and one webpage that needs custom parsing. Articles from both sources have the same structure, so they all can be parsed with the same preprocess_html()
So I thought to be clever and did something along this pseudo-code

Code:

class MyRecipe(BasicNewsRecipe) :
    INDEX = u'http://example.com'
    feeds = [(u'example', u'http://example.com/rss')]
    
    def parse_index(self) :
        #raise Exception('spam', 'eggs') # This is always raised
        answer = super(MyRecipe, self).parse_index()
        #raise Exception('spam', 'bacon and eggs') # This is never raised, but the feeds _are_  parsed
        
        #  Do my thing with self.INDEX . . .
        
        answer.insert(0, [myTitle, myArticles])
        
        return answer

But this does not work. The call to super.parse_index() never returns, where I expected it to have the same signature.

What am I missing, and is there a workaround?

kovidgoyal · 11-22-2009, 10:20 AM

IIRC, parse_index in the base class is not implemented at all. It will just raise an exception.

evanmaastrigt · 11-22-2009, 03:57 PM

Quote:

Originally Posted by fortunados

There are links like this one "http://www.farodevigo.es//elementosInt/rss/2" that I can open in firefox and read them as RSS.

snip...

To the point...
I can open and see rss with firefox, but there is no way to do it with calibre, it says failed feed and anything else.

Here is what I did: I opened the first link in this RSS feed in my browser and was presented with a 'nice' flash movie. You can click this away or sit it out and only after that you can read the article. Any subsequent link from the same feed opens without that flash movie.
Next I destroyed all my cookies and clicked 'reload'. There was the flash movie again.

Nice...

Now, as far as I understand Calibre uses only one instance of a browser to parse pages; and that browser supports cookies. So a possible workaround is to parse the feeds by hand, open the first article manually, ignore the result and let Calibre proceed. As cookies are now set, it should work. Or maybe not, I don't know.

Maybe Kovid can tell if this is feasible.

kovidgoyal · 11-22-2009, 04:03 PM

You can turn off/on cookies and overload get_article_url to avoid the flash movie.

evanmaastrigt · 11-22-2009, 06:14 PM

Quote:

Originally Posted by kovidgoyal

IIRC, parse_index in the base class is not implemented at all. It will just raise an exception.

Weird...

If I can reproduce the behavior I observed, should I open a ticket? Because I think it is a nice-to-have feature. It will open the whole can of worms of backwards-compatibility, but hey is that my problem ;-)

kovidgoyal · 11-22-2009, 11:27 PM

The way to do this is to override get_article_url

in get_article_url you fetch the actual page using index_to_soup, check if the flash movie is on it if so, return the url of the actual page

Spankypoo · 11-23-2009, 05:36 AM

Anyone know of a way to select articles for inclusion/exclusion based on their title?

E.g., I'd like to only pull articles containing the phrase "Calibre r0x0rz" from an RSS feed, and have it exclude the others.

Thanks!

fortunados · 11-23-2009, 05:50 AM

Well I cannot see any flash in the articles I have readed, actually this is not the problem at all if you check the other articles.

I am just tryinf to get somthing in calibre but I cannot get anything with the address

http://www.farodevigo.es//elementosInt/rss/2

But I can see the rss page and the code and so when I open it in firefox, I dont know if it is related to flash.

I have no clue but I amagine that there is something in the server that checks the browser or something with Java and send or not the page, but this I am just guessing.

If anyone could cook a recipe of just give me any hints I would apprecciate.

Regards.

Quote:

Originally Posted by evanmaastrigt

Here is what I did: I opened the first link in this RSS feed in my browser and was presented with a 'nice' flash movie. You can click this away or sit it out and only after that you can read the article. Any subsequent link from the same feed opens without that flash movie.
Next I destroyed all my cookies and clicked 'reload'. There was the flash movie again.

Nice...

Now, as far as I understand Calibre uses only one instance of a browser to parse pages; and that browser supports cookies. So a possible workaround is to parse the feeds by hand, open the first article manually, ignore the result and let Calibre proceed. As cookies are now set, it should work. Or maybe not, I don't know.

Maybe Kovid can tell if this is feasible.

11-21-2009, 08:00 AM	#874
evanmaastrigt Connoisseur Posts: 78 Karma: 192 Join Date: Nov 2009 Device: Sony PRS-600	Fokke en Sukke v2 Kovid was so kind to add the recipe for the 'Fokke en Sukke' cartoons to the latest version of Calibre. Unfortunately , something went wrong in the conversion from tabs to spaces, breaking the recipe (my bad really, should not have used tabs in the first place). Here is the corrected version fokkeensukke.zip

11-21-2009, 08:19 AM	#875
JayCeeEll Connoisseur Posts: 87 Karma: 204 Join Date: Dec 2007 Location: Exeter, Devon, UK Device: PRS-300	remove_tags not removing tags I am working on some new recipes and I am having trouble with the remove_tags pre-processing routine. The following script should just download the blog entry and comments, but I am also getting the sidebar contents, what am I doing wrong? An example article is http://www.badscience.net/2009/11/oh-that-was-quick/ PHP Code: __license__ = 'GPL v3' __copyright__ = '2009, JayCeeEll' from calibre.web.feeds.news import BasicNewsRecipe class BadScience(BasicNewsRecipe): title = u'Bad Science' language = 'en' __author__ = 'JayCeeEll' description = 'Bad science in the media' author = 'Ben Goldacre' publisher = 'Ben Goldacre' category = 'blog, skepticism' oldest_article = 7 max_articles_per_feed = 100 no_stylesheets = True encoding = 'utf8' remove_javascript = True use_embedded_content = False keep_only_tags = [dict(name='div', attrs={'class':'padded'})] remove_tags = [ dict(name='p', attrs={'class':'meta'}) ,dict(name='div', attrs={'id':'respond'}) ,dict(name='div', attrs={'id':'sidebar_right'}) ] feeds = [(u'Bad Science' , u'http://www.badscience.net/feed/' )]

11-22-2009, 09:01 AM	#878
evanmaastrigt Connoisseur Posts: 78 Karma: 192 Join Date: Nov 2009 Device: Sony PRS-600	parse_index() question I am working on a recipe consisting of a couple of RSS feeds and one webpage that needs custom parsing. Articles from both sources have the same structure, so they all can be parsed with the same preprocess_html() So I thought to be clever and did something along this pseudo-code Code: class MyRecipe(BasicNewsRecipe) : INDEX = u'http://example.com' feeds = [(u'example', u'http://example.com/rss')] def parse_index(self) : #raise Exception('spam', 'eggs') # This is always raised answer = super(MyRecipe, self).parse_index() #raise Exception('spam', 'bacon and eggs') # This is never raised, but the feeds _are_ parsed # Do my thing with self.INDEX . . . answer.insert(0, [myTitle, myArticles]) return answer But this does not work. The call to super.parse_index() never returns, where I expected it to have the same signature. What am I missing, and is there a workaround?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 03:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 01:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 06:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 05:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 03:37 PM

11-20-2009, 09:51 AM	#871
dhiru Connoisseur Posts: 83 Karma: 10 Join Date: Aug 2009 Device: iphone, Irex iliad, sony prs950, kindle Dx, Ipad	hi kiklop74 could you please make recipe from moneycontrol.com rss feed- http://www.moneycontrol.com/rss/latestnews.xml http://www.moneycontrol.com/rss/allstories.xml thanks

11-21-2009, 09:00 AM	#876
evanmaastrigt Connoisseur Posts: 78 Karma: 192 Join Date: Nov 2009 Device: Sony PRS-600	The div with id= sidebar_right, which you want to remove, contains a div with a class= padded, which you want to keep. I think this confuses Calibre a little.

11-22-2009, 05:52 AM	#877
tranqui69 Junior Member Posts: 2 Karma: 10 Join Date: Nov 2009 Device: Sony PRS-505	First of all: You're awesome!!!. Could you please make recipe from this two spanish newspapers rss feed? http://www.levante-emv.com/ http://www.publico.es/ They have rss but i can't do it. Thank You so much!!

11-22-2009, 10:20 AM	#879
kovidgoyal creator of calibre Posts: 45,610 Karma: 28549044 Join Date: Oct 2006 Location: Mumbai, India Device: Various	IIRC, parse_index in the base class is not implemented at all. It will just raise an exception.

11-22-2009, 04:03 PM	#881
kovidgoyal creator of calibre Posts: 45,610 Karma: 28549044 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You can turn off/on cookies and overload get_article_url to avoid the flash movie.

11-22-2009, 11:27 PM	#883
kovidgoyal creator of calibre Posts: 45,610 Karma: 28549044 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The way to do this is to override get_article_url in get_article_url you fetch the actual page using index_to_soup, check if the flash movie is on it if so, return the url of the actual page

11-23-2009, 05:36 AM	#884
Spankypoo Enthusiast Posts: 29 Karma: 499348 Join Date: Jun 2009 Device: Myriad	Anyone know of a way to select articles for inclusion/exclusion based on their title? E.g., I'd like to only pull articles containing the phrase "Calibre r0x0rz" from an RSS feed, and have it exclude the others. Thanks!