new recipe all done. and an idea.

marbs · 11-01-2010, 09:36 AM

the idea is that some of the web sites we use for recipes earn money for advertising. if we skip the article page and go to the print version, the site will suffer. in this recipe, and in all my future ones, i will download the article page before i go to the print version.

so this recipe is ready to be builtin.

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re

class AdvancedUserRecipe1283848012(BasicNewsRecipe):
    description   = 'This is a recipe of Calcalist.co.il. The recipe downloads the article page to not hurt the sites advertising income.'
    cover_url      = 'http://ftp5.bizportal.co.il/web/giflib/news/calcalist.JPG'
    title          = u'Calcalist'
    language              = _('Hebrew')
    __author__ = 'marbs'
    extra_css='img {max-width:100%;} body{direction: rtl;},title{direction: rtl; } ,article_description{direction: rtl; }, a.article{direction: rtl; } ,calibre_feed_description{direction: rtl; }'
    simultaneous_downloads = 5
    remove_javascript     = True
    timefmt        = '[%a, %d %b, %Y]'
    oldest_article = 1
    max_articles_per_feed = 100
    remove_attributes = ['width']
    simultaneous_downloads = 5
    keep_only_tags =dict(name='div', attrs={'id':'articleContainer'}) 
    remove_tags = [dict(name='p', attrs={'text':['&nbsp;']})]
    max_articles_per_feed = 100
    preprocess_regexps = [
        (re.compile(r'<p>&nbsp;</p>', re.DOTALL|re.IGNORECASE), lambda match: '')
        ]


    feeds          = [(u'\u05d3\u05e3 \u05d4\u05d1\u05d9\u05ea', u'http://www.calcalist.co.il/integration/StoryRss8.xml'),                            
                           (u'24/7', u'http://www.calcalist.co.il/integration/StoryRss3674.xml'), 
                           (u'\u05d1\u05d0\u05d6\u05d6', u'http://www.calcalist.co.il/integration/StoryRss3674.xml'),                            
                           (u'\u05de\u05d1\u05d6\u05e7\u05d9\u05dd', u'http://www.calcalist.co.il/integration/StoryRss184.xml'), 
                           (u'\u05d4\u05e9\u05d5\u05e7', u'http://www.calcalist.co.il/integration/StoryRss2.xml'), 
                           (u'\u05d1\u05d0\u05e8\u05e5', u'http://www.calcalist.co.il/integration/StoryRss14.xml'), 
                           (u'\u05d4\u05db\u05e1\u05e3', u'http://www.calcalist.co.il/integration/StoryRss9.xml'), 
                           (u'\u05e0\u05d3\u05dc"\u05df', u'http://www.calcalist.co.il/integration/StoryRss7.xml'), 
                           (u'\u05e2\u05d5\u05dc\u05dd', u'http://www.calcalist.co.il/integration/StoryRss13.xml'), 
                           (u'\u05e4\u05e8\u05e1\u05d5\u05dd \u05d5\u05e9\u05d9\u05d5\u05d5\u05e7', u'http://www.calcalist.co.il/integration/StoryRss5.xml'), 
                           (u'\u05e4\u05e0\u05d0\u05d9', u'http://www.calcalist.co.il/integration/StoryRss3.xml'), 
                           (u'\u05d8\u05db\u05e0\u05d5\u05dc\u05d5\u05d2\u05d9', u'http://www.calcalist.co.il/integration/StoryRss4.xml'), 
                           (u'\u05e2\u05e1\u05e7\u05d9 \u05e1\u05e4\u05d5\u05e8\u05d8', u'http://www.calcalist.co.il/integration/StoryRss18.xml')]
       
    def print_version(self, url):
        br = BasicNewsRecipe.get_browser()
        br.open(url)
        print 'ORG URL IS: ', url
        split1 = url.split("-")
        print 'THE SPLIT IS: ', split1 
        weblinks = url
        print_url = 'http://www.calcalist.co.il/Ext/Comp/ArticleLayout/CdaArticlePrintPreview/1,2506,L-' + split1[1]      
        print 'THIS URL WILL PRINT: ', print_url # this is a test string to see what the url is it will return
        return print_url

kovidgoyal · 11-01-2010, 01:29 PM

Note that downloading the article page will almost certainly not help, since most ad systems rely on javascript to fetch the add once the page has loaded and since the news download system does not execute javascript, the ad view is never registered.

Generally speaking, most web based ad systems rely on the browser reporting back to the ad server. Since ebooks do not support javascript and are often viewwwed in a context without an internet connection, a web based ad system is unlikely to work for them.

marbs · 11-01-2010, 05:01 PM

i see. i will leave it like this and of anyone using the recipe feels it is too slow, they can remove the 2 lines of code.

in any case, it is a really good recipe, if i must say so my self.

TonytheBookworm · 11-01-2010, 08:24 PM

All my years I have stripped ads from the websites I view using abp and so forth, but then you wanna add them to a ebook recipe?

I see amazon in the near future taking and making the screensaver screen be an ad. Again when/if that happens I will install a jailbreak and remove that. If you really don't want the website not to suffer then consider sending them a paypal donation but ad's boo hiss on that!

marbs · 11-02-2010, 03:47 AM

i agree with not wanting to see ads. but when the site goes to advertisers it says "we have 1 million visitors a month, an add will be some amount". that is what keeps them in business. no i understand there are more sophisticate ways of measuring ads, but i am sure the number of times a page is browsed to is a factor.

all i did was add two lines in print_version that open the original article before getting the print version. if it takes the recipe a couple more minutes to download, so be it. my computer can turn on on its own a few minutes earlier.

i want to support my news sites, and it makes no difference for the end file i get...

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
My "read" tag idea enhancement for Calibre idea	rcuadro	Calibre	10	01-20-2011 04:23 PM
I have an Idea	Tim32127	News	23	01-04-2010 11:55 PM
Unutterably Silly I have no idea.	pshrynk	Lounge	18	04-27-2009 02:09 AM

11-01-2010, 01:29 PM	#2
kovidgoyal creator of calibre Posts: 43,856 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Note that downloading the article page will almost certainly not help, since most ad systems rely on javascript to fetch the add once the page has loaded and since the news download system does not execute javascript, the ad view is never registered. Generally speaking, most web based ad systems rely on the browser reporting back to the ad server. Since ebooks do not support javascript and are often viewwwed in a context without an internet connection, a web based ad system is unlikely to work for them.

11-01-2010, 05:01 PM	#3
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	i see. i will leave it like this and of anyone using the recipe feels it is too slow, they can remove the 2 lines of code. in any case, it is a really good recipe, if i must say so my self.

11-01-2010, 08:24 PM	#4
TonytheBookworm Addict Posts: 264 Karma: 62 Join Date: May 2010 Device: kindle 2, kindle 3, Kindle fire	All my years I have stripped ads from the websites I view using abp and so forth, but then you wanna add them to a ebook recipe? I see amazon in the near future taking and making the screensaver screen be an ad. Again when/if that happens I will install a jailbreak and remove that. If you really don't want the website not to suffer then consider sending them a paypal donation but ad's boo hiss on that!

11-02-2010, 03:47 AM	#5
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	i agree with not wanting to see ads. but when the site goes to advertisers it says "we have 1 million visitors a month, an add will be some amount". that is what keeps them in business. no i understand there are more sophisticate ways of measuring ads, but i am sure the number of times a page is browsed to is a factor. all i did was add two lines in print_version that open the original article before getting the print version. if it takes the recipe a couple more minutes to download, so be it. my computer can turn on on its own a few minutes earlier. i want to support my news sites, and it makes no difference for the end file i get...

Advert

Advert