Custom recipes (archive, read-only) - Page 156

tayseidel · 07-17-2010, 04:51 AM

I would like a custom recipe to download print articles from thecolumbian.com. I tried to modify the recipe to add in the ?print after each url but failed. For instance, for each article you visit at the thecolumbian.com you simply need to type "?print" (without the quotation marks) and you can view the print edition. I would like a recipe for the all the RSS feeds on the site if possible, using the print version.

Examples:

http://www.columbian.com/news/2010/j...fort-festival/

just type in ?print after the slash and you get the print edition

http://www.columbian.com/news/2010/j...estival/?print

rty · 07-17-2010, 05:44 AM

Quote:

Originally Posted by koray

Can somebody help me with writing a recipe for MIT Technology Review?

Cheers,

Koray

Sure... for a nice fella who knows how to ask nicely!

Recipe for Technology Review:

Updated to remove the Flash Macromedia advertisement.

@Kovid: I have updated the recipe for Alternet as well to remove the "Width" attribute so that it can display properly on reading devices. https://www.mobileread.com/forums/sho...postcount=2325

koray · 07-17-2010, 06:03 AM

Quote:

Originally Posted by rty

Sure... for a nice fella who knows how to ask nicely!

Recipe for Technology Review:

A million thanks, rty! Works fabulously!

Cheers,

K.

TonytheBookworm · 07-18-2010, 01:24 AM

Quote:

Originally Posted by rty

Here it is:

Recipe for ALTERNET.ORG

ps. The original print pages at alternet.org got broken logo

UPDATED to remove the predefined width display

thanks a million!!!

strick242 · 07-18-2010, 05:50 PM

I would like to have a recipe for The Tampa Tribune. I'm having a hard time following the instructions myself, so maybe one of you guru's can help me out...thanks!
http://www.tampatrib.com/

tbrenske · 07-18-2010, 09:01 PM

has anyone had a chance to look at relevantmagazine.com?

bhandarisaurabh · 07-18-2010, 10:29 PM

Quote:

Originally Posted by bhandarisaurabh

AN SOMEONE MAKE RECIPE FOR WHARTON INDIA@ KNOWLEDGE
http://knowledge.wharton.upenn.edu/india/rss/

AND FINANCIAL EXPRESS PRINT EDITION WITHOUT USING FEEDS AND USING THE LINK
http://www.financialexpress.com/print/

I had posted this request earlier too,if someone can help please do.
Thanks in advance

iLeaveYou · 07-21-2010, 03:22 AM

Hello!!!
I was asking for this before.
Maybe I didn't do it nice enough or nobody was available (able) to do it.
Could somebody be that kind and do a recipe for this: http://www.realitatea.net/rss.html ?
They probably have the best rss feeds for the best Romanian News.
I would do it myself but I was never good in such a deep thing.
Your support is greatly appreciated.

mohmedic · 07-21-2010, 10:36 PM

ok, I have tried to figure out what the heck you guys are doing for other feeds and apply them to mine but I ain't that smart!!
Here is my half finished recipe if someone would be so kind as to take a look and tell me how i can get this website minus all the crap!! i have the print pages but couldn't figure out how to do the find replace to change 2 different parts of the url.
thanks!

Code:

class AdvancedUserRecipe1279635146(BasicNewsRecipe):
    title          = u'EMS1'
    oldest_article = 7
    max_articles_per_feed = 100

    use_embedded_content = False
    no_stylesheets = True
   
  

    feeds          = [(u'columnist', u'http://www.ems1.com/ems-rss-feeds/columnists.xml'),
                          (u'topics', u'http://www.ems1.com/ems-rss-feeds/topics.xml'), 
                          (u'most popular', u'http://www.ems1.com/ems-rss-feeds/most-popular-articles.xml'), 
                          (u'EMS Tips', u'http://www.ems1.com/ems-rss-feeds/tips.xml'), 
                          (u'Daily news', u'http://www.ems1.com/ems-rss-feeds/news.xml')]
    
    def print_version(self, url):
        baseurl = url.rpartition('/?')[0]
        turl = baseurl.partition('/reviews/')[2]
        return 'http://www.ems1.com/print.asp?act=print&vid=' + turl

rty · 07-22-2010, 02:03 AM

Quote:

Originally Posted by mohmedic

ok, I have tried to figure out what the heck you guys are doing for other feeds and apply them to mine but I ain't that smart!!
Here is my half finished recipe if someone would be so kind as to take a look and tell me how i can get this website minus all the crap!! i have the print pages but couldn't figure out how to do the find replace to change 2 different parts of the url.
thanks!

You can study the recipe I made for Technology Review above (post#2327). It's quite similar.

Take one article for example:
'http://www.ems1.com/fire-ems/articles/852270-EMT-with-firemans-key-accused-of-NY-sex-attacks/'.

The print version for this article is
'http://www.ems1.com/print.asp?act=print&vid=852270'

Your base URL for the print version should be 'http://www.ems1.com/print.asp?act=print&vid='. You need to append this base URL with the number found in the original article URL, i.e. 852270. To extract this number you need to split the URL using "/" and "-" as the delimiters for the splits.

trustin · 07-22-2010, 09:01 AM

I'm not sure if this thread is the right place to post my recipe, but here it is:

Code:

import re
from datetime import date, timedelta

from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, NavigableString ,Comment

class MediaDaumRecipe(BasicNewsRecipe):
    title = u'\uBBF8\uB514\uC5B4 \uB2E4\uC74C \uC624\uB298\uC758 \uC8FC\uC694 \uB274\uC2A4'
    language  = 'ko'
    max_articles = 100

    timefmt = ''
    masthead_url = 'http://img-media.daum-img.net/2010ci/service_news.gif'
    cover_margins = (18,18,'grey99')
    no_stylesheets = True
    remove_tags_before = dict(id='GS_con')
    remove_tags_after  = dict(id='GS_con')
    remove_tags = [dict(attrs={'class':[
                            'bline',
                            'GS_vod',
                            ]}),
                   dict(id=[
                            'GS_swf_poll',
                            'ad250',
                            ]),
                   dict(name=['script', 'noscript', 'style', 'object'])]
    preprocess_regexps = [
       (re.compile(r'<\s+', re.DOTALL|re.IGNORECASE),
        lambda match: '&lt; '),
       (re.compile(r'(<br[^>]*>[ \t\r\n]*){3,}', re.DOTALL|re.IGNORECASE),
        lambda match: ''),
       (re.compile(r'(<br[^>]*>[ \t\r\n]*)*</div>', re.DOTALL|re.IGNORECASE),
        lambda match: '</div>'),
       (re.compile(r'(<br[^>]*>[ \t\r\n]*)*</p>', re.DOTALL|re.IGNORECASE),
        lambda match: '</p>'),
       (re.compile(r'(<br[^>]*>[ \t\r\n]*)*</td>', re.DOTALL|re.IGNORECASE),
        lambda match: '</td>'),
       (re.compile(r'(<br[^>]*>[ \t\r\n]*)*</strong>', re.DOTALL|re.IGNORECASE),
        lambda match: '</strong>'),
       (re.compile(r'(<br[^>]*>[ \t\r\n]*)*</b>', re.DOTALL|re.IGNORECASE),
        lambda match: '</b>'),
       (re.compile(r'(<br[^>]*>[ \t\r\n]*)*</em>', re.DOTALL|re.IGNORECASE),
        lambda match: '</em>'),
       (re.compile(r'(<br[^>]*>[ \t\r\n]*)*</i>', re.DOTALL|re.IGNORECASE),
        lambda match: '</i>'),
       (re.compile(u'\(\uB05D\)[ \t\r\n]*<br[^>]*>.*</div>', re.DOTALL|re.IGNORECASE),
        lambda match: '</div>'),
       (re.compile(r'(<br[^>]*>[ \t\r\n]*)*<div', re.DOTALL|re.IGNORECASE),
        lambda match: '<div'),
       (re.compile(r'(<br[^>]*>[ \t\r\n]*)*<p', re.DOTALL|re.IGNORECASE),
        lambda match: '<p'),
       (re.compile(r'(<br[^>]*>[ \t\r\n]*)*<table', re.DOTALL|re.IGNORECASE),
        lambda match: '<table'),
       (re.compile(r'<strong>(<br[^>]*>[ \t\r\n]*)*', re.DOTALL|re.IGNORECASE),
        lambda match: '<strong>'),
       (re.compile(r'<b>(<br[^>]*>[ \t\r\n]*)*', re.DOTALL|re.IGNORECASE),
        lambda match: '<b>'),
       (re.compile(r'<em>(<br[^>]*>[ \t\r\n]*)*', re.DOTALL|re.IGNORECASE),
        lambda match: '<em>'),
       (re.compile(r'<i>(<br[^>]*>[ \t\r\n]*)*', re.DOTALL|re.IGNORECASE),
        lambda match: '<i>'),
       (re.compile(u'(<br[^>]*>[ \t\r\n]*)*(\u25B6|\u25CF|\u261E|\u24D2|\(c\))*\[[^\]]*(\u24D2|\(c\)|\uAE30\uC0AC|\uC778\uAE30[^\]]*\uB274\uC2A4)[^\]]*\].*</div>', re.DOTALL|re.IGNORECASE),
        lambda match: '</div>'),
    ]

    def parse_index(self):
        today = date.today();
        articles = []
        articles = self.parse_list_page(articles, today)
        articles = self.parse_list_page(articles, today - timedelta(1))
        return [('\uBBF8\uB514\uC5B4 \uB2E4\uC74C \uC624\uB298\uC758 \uC8FC\uC694 \uB274\uC2A4', articles)]
        

    def parse_list_page(self, articles, date):
        if len(articles) >= self.max_articles:
            return articles

        for page in range(1, 10):
            soup = self.index_to_soup('http://media.daum.net/primary/total/list.html?cateid=100044&date=%(date)s&page=%(page)d' % {'date': date.strftime('%Y%m%d'), 'page': page})
            done = True
            for item in soup.findAll('dl'):
                dt = item.find('dt', { 'class': 'tit' })
                dd = item.find('dd', { 'class': 'txt' })
                if dt is None:
                    break
                a = dt.find('a', href=True)
                url = 'http://media.daum.net/primary/total/' + a['href']
                title = self.tag_to_string(dt)
                if dd is None:
                    description = ''
                else:
                    description = self.tag_to_string(dd)
                articles.append(dict(title=title, description=description, url=url, content=''))
                done = len(articles) >= self.max_articles                   
                if done:
                    break
            if done:
                break
        return articles


    def preprocess_html(self, soup):
        return self.strip_anchors(soup)

    def strip_anchors(self, soup):
        for para in soup.findAll(True):
            aTags = para.findAll('a')
            for a in aTags:
                if a.img is None:
                    a.replaceWith(a.renderContents().decode('utf-8','replace'))
        return soup

This recipe fetches the latest top stories from http://media.daum.net/ which is one of the most popular news portal in South Korea. This is my first attempt to write a recipe and I'm not a Python user, so it might have some rough edges, but it just works fine for me.

As a backup, I also uploaded this recipe to http://pastebin.com/mEptXLsN

Starson17 · 07-22-2010, 11:43 AM

Quote:

Originally Posted by Starson17

Quote:

Originally Posted by CaptainJSK

1) I would like them to be displayed in reverse order (i.e., older entries first) so that I can catch up on things I've missed

Search this thread for "reverse" and look at my GoComics recipe. It does a reverse of date order for comic strips with:
current_articles.reverse()
It requires that you build the article feed yourself before reversing it.

This is an old question, but I only recently learned that there's another solution. There's a built-in option to reverse article order. I've never seen it used in a recipe or documented, but found it while perusing Calibre's code. It's:

Code:

reverse_article_order = True

Add this to a recipe and the article order is switched to oldest first. I've been using it in comics recipes.

mohmedic · 07-22-2010, 03:43 PM

rty,
thanks for your help but I still am at a loss. i added the print page lines and now get less. i don't think i set up the split right (copy and paste from tech review and altered)

Code:

class AdvancedUserRecipe1279635146(BasicNewsRecipe):
    title          = u'EMS1'
    oldest_article = 7
    max_articles_per_feed = 100

    use_embedded_content = False
  
   
  

    feeds          = [(u'columnist', u'http://www.ems1.com/ems-rss-feeds/columnists.xml'),
                          (u'topics', u'http://www.ems1.com/ems-rss-feeds/topics.xml'), 
                          (u'most popular', u'http://www.ems1.com/ems-rss-feeds/most-popular-articles.xml'), 
                          (u'EMS Tips', u'http://www.ems1.com/ems-rss-feeds/tips.xml'), 
                          (u'Daily news', u'http://www.ems1.com/ems-rss-feeds/news.xml')]
    
    def print_version(self, url):
        baseurl='http://www.ems1.com/print.asp?act=print&vid=' 
        split1 = string.split(url,"/")
        xxx=split1 [4]
        split2= string.split(xxx,"-")  
        s =  baseurl + split2[0]
        return s

Starson17 · 07-22-2010, 05:01 PM

Quote:

Originally Posted by mohmedic

rty,
thanks for your help but I still am at a loss. i added the print page lines and now get less. i don't think i set up the split right (copy and paste from tech review and altered)

Spoiler:

You have a couple of problems. To start, you didn't import string.

You can fix that with:
import string
from calibre.web.feeds.news import BasicNewsRecipe
Next, your xxx=split1 [4] is wrong. Worse, it sometimes should be xxx=split1[5] and other times should be xxx=split1[6]

You need to test the result of the split2 to see if it's an integer. There's lots of ways to do it. I used a try/except and integer conversion. I also changed the split, so the import of string is not needed, but I left it in, in case you want to use it. Note that this only works if the number you need is in position 5 or 6. I didn't test all the recipe to see if it's ever in another location in the URL

Try this:

Spoiler:

mohmedic · 07-22-2010, 06:10 PM

thank you starson17. this works fine and i have more to fix. i will post with more questions i am sure

07-18-2010, 05:50 PM	#2330
strick242 Junior Member Posts: 1 Karma: 10 Join Date: Jul 2010 Device: iphone and stanza	Custom Recipe Request I would like to have a recipe for The Tampa Tribune. I'm having a hard time following the instructions myself, so maybe one of you guru's can help me out...thanks! http://www.tampatrib.com/

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 03:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 01:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 06:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 05:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 03:37 PM

07-17-2010, 04:51 AM	#2326
tayseidel Zealot Posts: 146 Karma: 189664 Join Date: Feb 2009 Device: Glo HD, Aura H20, PRS-T1	I would like a custom recipe to download print articles from thecolumbian.com. I tried to modify the recipe to add in the ?print after each url but failed. For instance, for each article you visit at the thecolumbian.com you simply need to type "?print" (without the quotation marks) and you can view the print edition. I would like a recipe for the all the RSS feeds on the site if possible, using the print version. Examples: http://www.columbian.com/news/2010/j...fort-festival/ just type in ?print after the slash and you get the print edition http://www.columbian.com/news/2010/j...estival/?print

07-18-2010, 09:01 PM	#2331
tbrenske Junior Member Posts: 2 Karma: 10 Join Date: Jul 2010 Device: nook	has anyone had a chance to look at relevantmagazine.com?

07-21-2010, 03:22 AM	#2333
iLeaveYou Junior Member Posts: 5 Karma: 10 Join Date: Jul 2010 Device: Kindle DX	Hello!!! I was asking for this before. Maybe I didn't do it nice enough or nobody was available (able) to do it. Could somebody be that kind and do a recipe for this: http://www.realitatea.net/rss.html ? They probably have the best rss feeds for the best Romanian News. I would do it myself but I was never good in such a deep thing. Your support is greatly appreciated.

07-22-2010, 06:10 PM	#2340
mohmedic Junior Member Posts: 3 Karma: 10 Join Date: Jul 2010 Device: Nook	thank you starson17. this works fine and i have more to fix. i will post with more questions i am sure

Advert

Advert