Custom recipes (archive, read-only) - Page 81

Abelturd · 01-20-2010, 12:22 PM

Please would it be possible to make a recipe for readitlaterlist.com? Thanks.

nickredding · 01-20-2010, 04:23 PM

I have updated this recipe (thanks to kiklop74 and evanmaastrigt for suggestions) to imporve formatting and limit article downloads according to oldest_article. I have also improved the tag filtering to remove extraneous content and moved the customization area to the top of the recipe.

Code:

#!/usr/bin/env  python

__license__   = 'GPL v3'

'''
online.wsj.com
'''
import string, re
from calibre import strftime
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, NavigableString
from datetime import timedelta, datetime, date

class WSJ(BasicNewsRecipe):
    # formatting adapted from original recipe by Kovid Goyal and Sujata Raman
    title          = u'Wall Street Journal (free)'
    __author__     = 'Nick Redding'
    language = 'en'
    description = ('All the free content from the Wall Street Journal (business, financial and political news)')
 
    no_stylesheets = True
    timefmt = ' [%b %d]'

    # customization notes: delete sections you are not interested in
    # set omit_paid_content to False if you want the paid content article snippets
    # set oldest_article to the maximum number of days back from today to include articles
    sectionlist = [
                        ['/home-page','Front Page'],
                        ['/public/page/news-opinion-commentary.html','Commentary'],
                        ['/public/page/news-global-world.html','World News'],
                        ['/public/page/news-world-business.html','US News'],
                        ['/public/page/news-business-us.html','Business'],
                        ['/public/page/news-financial-markets-stock.html','Markets'],
                        ['/public/page/news-tech-technology.html','Technology'],
                        ['/public/page/news-personal-finance.html','Personal Finnce'],
                        ['/public/page/news-lifestyle-arts-entertainment.html','Life & Style'],
                        ['/public/page/news-real-estate-homes.html','Real Estate'],
                        ['/public/page/news-career-jobs.html','Careers'],
                        ['/public/page/news-small-business-marketing.html','Small Business']
                    ]
    oldest_article = 2
    omit_paid_content = True
    
    extra_css   = '''h1{font-size:large; font-family:Times,serif;}
                    h2{font-family:Times,serif; font-size:small; font-style:italic;}
                    .subhead{font-family:Times,serif; font-size:small; font-style:italic;}
                    .insettipUnit {font-family:Times,serif;font-size:xx-small;}
                    .targetCaption{font-size:x-small; font-family:Times,serif; font-style:italic; margin-top: 0.25em;}
                    .article{font-family:Times,serif; font-size:x-small;}
                    .tagline { font-size:xx-small;}
                    .dateStamp {font-family:Times,serif;}
                    h3{font-family:Times,serif; font-size:xx-small;}
                    .byline {font-family:Times,serif; font-size:xx-small; list-style-type: none;}
                    .metadataType-articleCredits {list-style-type: none;}
                    h6{font-family:Times,serif; font-size:small; font-style:italic;}
                    .paperLocation{font-size:xx-small;}'''


    remove_tags_before = dict({'class':re.compile('^articleHeadlineBox')})
    remove_tags =   [   dict({'id':re.compile('^articleTabs_tab_')}),
                        #dict(id=["articleTabs_tab_article", "articleTabs_tab_comments",
                        #         "articleTabs_tab_interactive","articleTabs_tab_video",
                        #         "articleTabs_tab_map","articleTabs_tab_slideshow"]),
			{'class':  ['footer_columns','network','insetCol3wide','interactive','video','slideshow','map',
                                    'insettip','insetClose','more_in', "insetContent",
                        #            'articleTools_bottom','articleTools_bottom mjArticleTools',
                                    'aTools', 'tooltip', 
                                    'adSummary', 'nav-inline','insetFullBracket']},
                        dict({'class':re.compile('^articleTools_bottom')}),
                        dict(rel='shortcut icon')
                    ]
    remove_tags_after = [dict(id="article_story_body"), {'class':"article story"}]
    
    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        return br

    def preprocess_html(self,soup):
        # check if article is too old
        datetag = soup.find('li',attrs={'class' : re.compile("^dateStamp")})
        if datetag:
            dateline_string = self.tag_to_string(datetag,False)
            date_items = dateline_string.split(',')
            datestring = date_items[0]+date_items[1]
            article_date = datetime.strptime(datestring.title(),"%B %d %Y")
            earliest_date = date.today() - timedelta(days=self.oldest_article)
            if article_date.date() < earliest_date:
                self.log("Skipping article dated %s" % datestring)
                return None
            datetag.parent.extract()

            # place dateline in article heading
            
            bylinetag = soup.find('h3','byline')
            if bylinetag:
                h3bylinetag = bylinetag
            else:
                bylinetag = soup.find('li','byline')
                if bylinetag:
                    h3bylinetag = bylinetag.h3
                    if not h3bylinetag:
                        h3bylinetag = bylinetag
                    bylinetag = bylinetag.parent
            if bylinetag:
                if h3bylinetag.a:
                    bylinetext = 'By '+self.tag_to_string(h3bylinetag.a,False)
                else:
                    bylinetext = self.tag_to_string(h3bylinetag,False)
                h3byline = Tag(soup,'h3',[('class','byline')])
                if bylinetext.isspace() or (bylinetext == ''):
                    h3byline.insert(0,NavigableString(date_items[0]+','+date_items[1]))
                else:
                    h3byline.insert(0,NavigableString(bylinetext+u'\u2014'+date_items[0]+','+date_items[1]))
                bylinetag.replaceWith(h3byline)
            else:                  
                headlinetag = soup.find('div',attrs={'class' : re.compile("^articleHeadlineBox")})
                if headlinetag:
                    dateline = Tag(soup,'h3', [('class','byline')])
                    dateline.insert(0,NavigableString(date_items[0]+','+date_items[1]))
                    headlinetag.insert(len(headlinetag),dateline)
        else: # if no date tag, don't process this page--it's not a news item
            return None
        # This gets rid of the annoying superfluous bullet symbol preceding columnist bylines
        ultag = soup.find('ul',attrs={'class' : 'cMetadata metadataType-articleCredits'})
        if ultag:
            a = ultag.h3
            if a:
                ultag.replaceWith(a)
        return soup

    def parse_index(self):

        articles = {}
        key = None
        ans = []

        def parse_index_page(page_name,page_title):

            def article_title(tag):
                atag = tag.find('h2') # title is usually in an h2 tag
                if not atag: # if not, get text from the a tag
                    atag = tag.find('a',href=True)
                    if not atag:
                        return ''
                    t = self.tag_to_string(atag,False)
                    if t == '':
                        # sometimes the title is in the second a tag
                        atag.extract()
                        atag = tag.find('a',href=True)
                        if not atag:
                            return ''
                        return self.tag_to_string(atag,False)
                    return t
                return self.tag_to_string(atag,False)

            def article_author(tag):
                atag = tag.find('strong') # author is usually in a strong tag
                if not atag:
                     atag = tag.find('h4') # if not, look for an h4 tag
                     if not atag:
                         return ''
                return self.tag_to_string(atag,False)

            def article_summary(tag):
                atag = tag.find('p')
                if not atag:
                    return ''
                subtag = atag.strong
                if subtag:
                    subtag.extract()
                return self.tag_to_string(atag,False)

            def article_url(tag):
                atag = tag.find('a',href=True)
                if not atag:
                    return ''
                url = re.sub(r'\?.*', '', atag['href'])
                return url

            def handle_section_name(tag):
                # turns a tag into a section name with special processing
                # for Wat's News, U.S., World & U.S. and World
                s = self.tag_to_string(tag,False)
                if ("What" in s) and ("News" in s):
                    s = "What's News"
                elif (s == "U.S.") or (s == "World & U.S.") or (s == "World"):
                    s = s + " News"
                return s

                

            mainurl = 'http://online.wsj.com'
            pageurl = mainurl+page_name
            #self.log("Page url %s" % pageurl)
            soup = self.index_to_soup(pageurl)
            # Find each instance of div with class including "headlineSummary"
            for divtag in soup.findAll('div',attrs={'class' : re.compile("^headlineSummary")}):
                # divtag contains all article data as ul's and li's
                # first, check if there is an h3 tag which provides a section name
                stag = divtag.find('h3')
                if stag:
                    if stag.parent['class'] == 'dynamic':
                        # a carousel of articles is too complex to extract a section name
                        # for each article, so we'll just call the section "Carousel"
                        section_name = 'Carousel'
                    else:
                        section_name = handle_section_name(stag)
                else:
                    section_name = "What's News"
                #self.log("div Section %s" % section_name)
                # find each top-level ul in the div
                # we don't restrict to class = newsItem because the section_name
                # sometimes changes via a ul tag inside the div
                for ultag in divtag.findAll('ul',recursive=False):
                    stag = ultag.find('h3')
                    if stag:
                        if stag.parent.name == 'ul':
                            # section name has changed
                            section_name = handle_section_name(stag)
                            #self.log("ul Section %s" % section_name)
                            # delete the h3 tag so it doesn't get in the way
                            stag.extract()
                    # find each top level li in the ul
                    for litag in ultag.findAll('li',recursive=False):
                        stag = litag.find('h3')
                        if stag:
                            # section name has changed
                            section_name = handle_section_name(stag)
                            #self.log("li Section %s" % section_name)
                            # delete the h3 tag so it doesn't get in the way
                            stag.extract()
                        # if there is a ul tag inside the li it is superfluous;
                        # it is probably a list of related articles
                        utag = litag.find('ul')
                        if utag:
                            utag.extract()
                        # now skip paid subscriber articles if desired
                        subscriber_tag = litag.find(text="Subscriber Content")
                        if subscriber_tag:
                                if self.omit_paid_content:
                                    continue             
                                # delete the tip div so it doesn't get in the way
                                tiptag = litag.find("div", { "class" : "tipTargetBox" })
                                if tiptag:
                                    tiptag.extract()
                        h1tag = litag.h1
                        # if there's an h1 tag, it's parent is a div which should replace
                        # the li tag for the analysis
                        if h1tag:
                            litag = h1tag.parent                  
                        h5tag = litag.h5
                        if h5tag:
                            # section mame has changed
                            section_name = self.tag_to_string(h5tag,False)
                            #self.log("h5 Section %s" % section_name)
                            # delete the h5 tag so it doesn't get in the way
                            h5tag.extract()
                        url = article_url(litag)
                        if url == '':
                            continue
                        if url.startswith("/article"):
                            url = mainurl+url
                        if not url.startswith("http://online.wsj.com"):
                            continue
                        if not url.endswith(".html"):
                            continue
                        if 'video' in url:
                            continue
                        title = article_title(litag)
                        if title == '':
                            continue
                        #self.log("URL %s" % url)
                        #self.log("Title %s" % title)
                        pubdate = ''
                        #self.log("Date %s" % pubdate)
                        author = article_author(litag)
                        if author == '':
                            author = section_name
                        elif author == section_name:
                            author = ''
                        else:
                            author = section_name+': '+author
                        #if not author == '':
                        #    self.log("Author %s" % author)
                        description = article_summary(litag)
                        #if not description == '':
                        #    self.log("Description %s" % description)
                        if not articles.has_key(page_title):
                            articles[page_title] = []
                        articles[page_title].append(dict(title=title,url=url,date=pubdate,description=description,author=author,content=''))

    
        for page_name,page_title in self.sectionlist:
            parse_index_page(page_name,page_title)
            ans.append(page_title)

        ans = [(key, articles[key]) for key in ans if articles.has_key(key)]
        return ans

jonesc42 · 01-20-2010, 05:57 PM

I was wondering if the Wall Street Journal (US) [subscription] recipe stopped working for anyone else? Thanks.

nickredding · 01-20-2010, 08:21 PM

The CanWest chain of Candian newspapers all use the same web format. Here is a recipe that will handle any of them--just un-comment the three lines in the header corresponding to the paper you want.

Code:

#!/usr/bin/env  python

__license__   = 'GPL v3'

'''
www.canada.com
'''
import string, re
from calibre import strftime
from calibre.web.feeds.news import BasicNewsRecipe

import string, re
from calibre import strftime
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag


class CanWestPaper(BasicNewsRecipe):

    # un-comment the following three lines for the Victoria Times Colonist
    #title = u'Victoria Times Colonist'
    #url_prefix = 'http://www.timescolonist.com'
    #description = u'News from Victoria, BC'

    # un-comment the following three lines for the Vancouver Province
    #title = u'Vancouver Province'
    #url_prefix = 'http://www.theprovince.com'
    #description = u'News from Vancouver, BC'

    # un-comment the following three lines for the Vancouver Sun
    #title = u'Vancouver Sun'
    #url_prefix = 'http://www.vancouversun.com'
    #description = u'News from Vancouver, BC'

    # un-comment the following three lines for the Edmonton Journal
    #title = u'Edmonton Journal'
    #url_prefix = 'http://www.edmontonjournal.com'
    #description = u'News from Edmonton, AB'

    # un-comment the following three lines for the Calgary Herald
    #title = u'Calgary Herald'
    #url_prefix = 'http://www.calgaryherald.com'
    #description = u'News from Calgary, AB'

    # un-comment the following three lines for the Regina Leader-Post
    #title = u'Regina Leader-Post'
    #url_prefix = 'http://www.leaderpost.com'
    #description = u'News from Regina, SK'

    # un-comment the following three lines for the Saskatoon Star-Phoenix
    #title = u'Saskatoon Star-Phoenix'
    #url_prefix = 'http://www.thestarphoenix.com'
    #description = u'News from Saskatoon, SK'

    # un-comment the following three lines for the Windsor Star
    #title = u'Windsor Star'
    #url_prefix = 'http://www.windsorstar.com'
    #description = u'News from Windsor, ON'

    # un-comment the following three lines for the Ottawa Citizen
    #title = u'Ottawa Citizen'
    #url_prefix = 'http://www.ottawacitizen.com'
    #description = u'News from Ottawa, ON'

    # un-comment the following three lines for the Montreal Gazette
    #title = u'Montreal Gazette'
    #url_prefix = 'http://www.montrealgazette.com'
    #description = u'News from Montreal, QC'

    
    language = 'en_CA'
    __author__ = 'Nick Redding'
    no_stylesheets = True
    timefmt = ' [%b %d]'
    extra_css = '''
                .timestamp {  font-size:xx-small; display: block; }
                #storyheader { font-size: medium; }
                #storyheader h1 { font-size: x-large; }
                #storyheader h2 { font-size: large;  font-style: italic; }
                .byline { font-size:xx-small; }
                #photocaption { font-size: small; font-style: italic }
                #photocredit { font-size: xx-small; }'''
    keep_only_tags = [dict(name='div', attrs={'id':'storyheader'}),dict(name='div', attrs={'id':'storycontent'})]
    remove_tags = [{'class':'comments'},
                   dict(name='div', attrs={'class':'navbar'}),dict(name='div', attrs={'class':'morelinks'}),
                   dict(name='div', attrs={'class':'viewmore'}),dict(name='li', attrs={'class':'email'}),
                   dict(name='div', attrs={'class':'story_tool_hr'}),dict(name='div', attrs={'class':'clear'}),
                   dict(name='div', attrs={'class':'story_tool'}),dict(name='div', attrs={'class':'copyright'}),
                   dict(name='div', attrs={'class':'rule_grey_solid'}),
                   dict(name='li', attrs={'class':'print'}),dict(name='li', attrs={'class':'share'}),dict(name='ul', attrs={'class':'bullet'})]

    def preprocess_html(self,soup):
        #delete iempty id attributes--they screw up the TOC for unknow reasons
        divtags = soup.findAll('div',attrs={'id':''})
        if divtags:
            for div in divtags:
                del(div['id'])
        return soup


    def parse_index(self):
        soup = self.index_to_soup(self.url_prefix+'/news/todays-paper/index.html')

        articles = {}
        key = 'News'
        ans = ['News']

        # Find each instance of class="sectiontitle", class="featurecontent"
        for divtag in soup.findAll('div',attrs={'class' : ["section_title02","featurecontent"]}):
                #self.log(" div class = %s" % divtag['class'])
                if divtag['class'].startswith('section_title'):
                    # div contains section title
                    if not divtag.h3:
                        continue
                    key = self.tag_to_string(divtag.h3,False)
                    ans.append(key)
                    self.log("Section name %s" % key)
                    continue
                # div contains article data
                h1tag = divtag.find('h1')
                if not h1tag:
                    continue
                atag = h1tag.find('a',href=True)
                if not atag:
                    continue
                url = self.url_prefix+'/news/todays-paper/'+atag['href']
                #self.log("Section %s" % key)
                #self.log("url %s" % url)
                title = self.tag_to_string(atag,False)
                #self.log("title %s" % title)
                pubdate = ''
                description = ''
                ptag = divtag.find('p');         
                if ptag:
                    description = self.tag_to_string(ptag,False)
                    #self.log("description %s" % description)
                author = ''
                autag = divtag.find('h4')
                if autag:
                    author = self.tag_to_string(autag,False)
                    #self.log("author %s" % author)
                if not articles.has_key(key):
                    articles[key] = []
                articles[key].append(dict(title=title,url=url,date=pubdate,description=description,author=author,content=''))

        ans = [(key, articles[key]) for key in ans if articles.has_key(key)]
        return ans

Nic · 01-21-2010, 12:11 AM

Can somebody make a recipe for www.ledevoir.com ? It would be really appreciated.

evanmaastrigt · 01-21-2010, 02:34 AM

There is a problem with the Wall Street Journal (free) recipe. It is on line 85:

Code:

            article_date = datetime.strptime(datestring.title(),"%B %d %Y")

The '%B' format specifier uses the local machine's locale for monthname, so this will break with any non-English locale, for instance on my machine with an English OS but Dutch locale:

Code:

ValueError: time data 'January 21 2010' does not match format '%B %d %Y'

The Python documentation warns against changing the locale temporarily: as I understand it, if you change it, the change will affect the whole program, not just your thread. And that might give some unwanted side-effects.

kiklop74 · 01-21-2010, 10:30 AM

New recipe for Le Devoir:

Sischa · 01-21-2010, 10:48 AM

Maybe i was a little bit too ambitious todays. I tried to create my first own recipe and failed ... royaly

So maybe comeone could lend me a little help here and make one recipe for this please?
www.welt.de

masterebotzki · 01-21-2010, 10:51 AM

Could some one please make one for popular science at

http://www.popsci.com/gadgets

http://www.popsci.com/technology

and

http://www.popsci.com/diy
thank you

nickredding · 01-21-2010, 11:59 AM

Interesting point. I'm not sure how to fix this since the WSJ date string that is being decoded is in the US locale format. The recipe would have to specify that and I don't see any format options that would do that. Any suggestions?

kovidgoyal · 01-21-2010, 12:05 PM

You can always write your own date parser, though what I did in the non free version is simply use the WSJ provided string as timefmt

nickredding · 01-21-2010, 12:11 PM

The offending strptime function can be enclosed in a try ... except statement to prevent the recipe method from halting in the case of a locale error. In this case, filtering against oldest_article could be omitted, in which case all articles would be included.

If I can't figure out a solution that handles the locale issue properly, I'll do that.

nickredding · 01-21-2010, 12:15 PM

Quote:

You can always write your own date parser, though what I did in the non free version is simply use the WSJ provided string as timefmt

The WSJ free receipe is trying to filter out articles older than oldest_article days, so it has to actually decode the date string so a comparison can be made.

kovidgoyal · 01-21-2010, 12:17 PM

Quote:

Originally Posted by nickredding

The WSJ free receipe is trying to filter out articles older than oldest_article days, so it has to actually decode the date string so a comparison can be made.

Code:

date = date.split()
month = {'January', February', ...}[date[0]]
day = int(date[1])
year = int(date[2])

XanthanGum · 01-21-2010, 12:50 PM

Hi,

I'm attaching a file that contains a recipe for the Columbia Journalism Review.

Enjoy...

XG

01-21-2010, 02:34 AM	#1206
evanmaastrigt Connoisseur Posts: 78 Karma: 192 Join Date: Nov 2009 Device: Sony PRS-600	Problem with Wall Street Journal (free) recipe There is a problem with the Wall Street Journal (free) recipe. It is on line 85: Code: article_date = datetime.strptime(datestring.title(),"%B %d %Y") The '%B' format specifier uses the local machine's locale for monthname, so this will break with any non-English locale, for instance on my machine with an English OS but Dutch locale: Code: ValueError: time data 'January 21 2010' does not match format '%B %d %Y' The Python documentation warns against changing the locale temporarily: as I understand it, if you change it, the change will affect the whole program, not just your thread. And that might give some unwanted side-effects.

01-21-2010, 11:59 AM	#1210
nickredding onlinenewsreader.net Posts: 331 Karma: 10143 Join Date: Dec 2009 Location: Phoenix, AZ & Victoria, BC Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire	re: Problem with Wall Street Journal (free) recipe Interesting point. I'm not sure how to fix this since the WSJ date string that is being decoded is in the US locale format. The recipe would have to specify that and I don't see any format options that would do that. Any suggestions?

01-21-2010, 12:11 PM	#1212
nickredding onlinenewsreader.net Posts: 331 Karma: 10143 Join Date: Dec 2009 Location: Phoenix, AZ & Victoria, BC Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire	re: Problem with Wall Street Journal (free) recipe The offending strptime function can be enclosed in a try ... except statement to prevent the recipe method from halting in the case of a locale error. In this case, filtering against oldest_article could be omitted, in which case all articles would be included. If I can't figure out a solution that handles the locale issue properly, I'll do that.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 02:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 12:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 05:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 04:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 02:37 PM

01-20-2010, 12:22 PM	#1201
Abelturd Little Fuzzy Soldier Posts: 580 Karma: 5711 Join Date: Sep 2008 Location: Nowhere in particular. Device: cybook gen3, htc hero, ipaq 214	Please would it be possible to make a recipe for readitlaterlist.com? Thanks.

01-20-2010, 05:57 PM	#1203
jonesc42 Junior Member Posts: 8 Karma: 10 Join Date: Jul 2009 Location: Massachusetts Device: nook	I was wondering if the Wall Street Journal (US) [subscription] recipe stopped working for anyone else? Thanks.

01-21-2010, 12:11 AM	#1205
Nic Junior Member Posts: 9 Karma: 10 Join Date: Jan 2010 Device: Sony touch edition	Can somebody make a recipe for www.ledevoir.com ? It would be really appreciated.

01-21-2010, 10:48 AM	#1208
Sischa Evangelist Posts: 428 Karma: 2370 Join Date: Jun 2006 Location: Germany Device: Nokia 770, Ilead, Cybook G3, Kindle DX, Kindle 2, iPad, Kindle 3, PW	Maybe i was a little bit too ambitious todays. I tried to create my first own recipe and failed ... royaly So maybe comeone could lend me a little help here and make one recipe for this please? www.welt.de

01-21-2010, 10:51 AM	#1209
masterebotzki Member Posts: 16 Karma: 10 Join Date: Jan 2010 Device: kindle 2i	Could some one please make one for popular science at http://www.popsci.com/gadgets http://www.popsci.com/technology and http://www.popsci.com/diy thank you

01-21-2010, 12:05 PM	#1211
kovidgoyal creator of calibre Posts: 45,604 Karma: 28548974 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You can always write your own date parser, though what I did in the non free version is simply use the WSJ provided string as timefmt