Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 01-16-2010, 02:54 PM   #1171
nickredding
onlinenewsreader.net
nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'
 
Posts: 328
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
Wall Street Journal (free)

Wall Street Journal -- here is a recipe for the free parts of the Wall Street Journal, which are quite extensive. If you run this recipe for all sections, you'll get over 7 MB (Kindle/MOBI) and it will take 30 minutes on a fast PC--that's a lot of material! If you don't want all of the sections, just delete the one you aren't interested in from sectionlist (at the bottom of the recipe). If you want the snippets from paid content, set omit_paid_content to False (it defaults to True which means paid content is skipped).

Comments on how to make the recipe run faster would be welcome--I think it's mainly a function of the quantity of material.
Code:
#!/usr/bin/env  python

__license__   = 'GPL v3'

'''
online.wsj.com.com
'''
import string, re
from calibre import strftime
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class WSJ(BasicNewsRecipe):
    # formatting adapted from original recipe by Kovid Goyal and Sujata Raman
    title          = u'Wall Street Journal (free)'
    no_stylesheets = True
    timefmt = ' [%b %d]'
    extra_css   = '''h1{font-size:large; font-family:Georgia,"Century Schoolbook","Times New Roman",Times,serif;}
                    h2{font-family:Georgia,"Century Schoolbook","Times New Roman",Times,serif; font-size:small; font-style:italic;}
                    .subhead{font-family:Georgia,"Century Schoolbook","Times New Roman",Times,serif; font-size:small; font-style:italic;}
                    .insettipUnit {font-family:Arial,Sans-serif;font-size:xx-small;}
                    .targetCaption{font-size:x-small; font-family:Arial,Helvetica,sans-serif;}
                    .article{font-family :Arial,Helvetica,sans-serif; font-size:x-small;}
                    .tagline { ont-size:xx-small;}
                    .dateStamp {font-family:Arial,Helvetica,sans-serif;}
                    h3{font-family:Arial,Helvetica,sans-serif; font-size:xx-small;}
                    .byline {font-family:Arial,Helvetica,sans-serif; font-size:xx-small; list-style-type: none;}
                    .metadataType-articleCredits {list-style-type: none;}
                    h6{ font-family:Georgia,"Century Schoolbook","Times New Roman",Times,serif; font-size:small;font-style:italic;}
                    .paperLocation{font-size:xx-small;}'''

    remove_tags_before = dict(name='h1')
    remove_tags =   [   dict(id=["articleTabs_tab_article", "articleTabs_tab_comments",
                                 "articleTabs_tab_interactive","articleTabs_tab_video",
                                 "articleTabs_tab_map","articleTabs_tab_slideshow"]),
			{'class':['footer_columns','network','insetCol3wide','interactive','video','slideshow','map',
			'insettip','insetClose','more_in', "insetContent", 'articleTools_bottom', 'aTools', 'tooltip', 
			'adSummary', 'nav-inline','insetFullBracket']},
                        dict(rel='shortcut icon'),
                    ]
    remove_tags_after = [dict(id="article_story_body"), {'class':"article story"}]
    
    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        return br

    def preprocess_html(self,soup):
        # This gets rid of the annoying superfluous bullet symbol preceding columnist bylines
        ultag = soup.find('ul',attrs={'class' : 'cMetadata metadataType-articleCredits'})
        if ultag:
            a = ultag.h3
            if a:
                ultag.replaceWith(a)
        return soup

    def parse_index(self):

        articles = {}
        key = None
        ans = []

        def parse_index_page(page_name,page_title,omit_paid_content):

            def article_title(tag):
                atag = tag.find('h2') # title is usually in an h2 tag
                if not atag: # if not, get text from the a tag
                    atag = tag.find('a',href=True)
                    if not atag:
                        return ''
                    t = self.tag_to_string(atag,False)
                    if t == '':
                        # sometimes the title is in the second a tag
                        atag.extract()
                        atag = tag.find('a',href=True)
                        if not atag:
                            return ''
                        return self.tag_to_string(atag,False)
                    return t
                return self.tag_to_string(atag,False)

            def article_author(tag):
                atag = tag.find('strong') # author is usually in a strong tag
                if not atag:
                     atag = tag.find('h4') # if not, look for an h4 tag
                     if not atag:
                         return ''
                return self.tag_to_string(atag,False)

            def article_summary(tag):
                atag = tag.find('p')
                if not atag:
                    return ''
                subtag = atag.strong
                if subtag:
                    subtag.extract()
                return self.tag_to_string(atag,False)

            def article_url(tag):
                atag = tag.find('a',href=True)
                if not atag:
                    return ''
                url = re.sub(r'\?.*', '', atag['href'])
                return url

            def handle_section_name(tag):
                # turns a tag into a section name with special processing
                # for Wat's News, U.S., World & U.S. and World
                s = self.tag_to_string(tag,False)
                if ("What" in s) and ("News" in s):
                    s = "What's News"
                elif (s == "U.S.") or (s == "World & U.S.") or (s == "World"):
                    s = s + " News"
                return s

                

            mainurl = 'http://online.wsj.com'
            pageurl = mainurl+page_name
            #self.log("Page url %s" % pageurl)
            soup = self.index_to_soup(pageurl)
            # Find each instance of div with class including "headlineSummary"
            for divtag in soup.findAll('div',attrs={'class' : re.compile("^headlineSummary")}):

                # divtag contains all article data as ul's and li's
                # first, check if there is an h3 tag which provides a section name
                stag = divtag.find('h3')
                if stag:
                    if stag.parent['class'] == 'dynamic':
                        # a carousel of articles is too complex to extract a section name
                        # for each article, so we'll just call the section "Carousel"
                        section_name = 'Carousel'
                    else:
                        section_name = handle_section_name(stag)
                else:
                    section_name = "What's News"
                #self.log("div Section %s" % section_name)
                # find each top-level ul in the div
                # we don't restrict to class = newsItem because the section_name
                # sometimes changes via a ul tag inside the div
                for ultag in divtag.findAll('ul',recursive=False):
                    stag = ultag.find('h3')
                    if stag:
                        if stag.parent.name == 'ul':
                            # section name has changed
                            section_name = handle_section_name(stag)
                            #self.log("ul Section %s" % section_name)
                            # delete the h3 tag so it doesn't get in the way
                            stag.extract()
                    # find each top level li in the ul
                    for litag in ultag.findAll('li',recursive=False):
                        stag = litag.find('h3')
                        if stag:
                            # section name has changed
                            section_name = handle_section_name(stag)
                            #self.log("li Section %s" % section_name)
                            # delete the h3 tag so it doesn't get in the way
                            stag.extract()
                        # if there is a ul tag inside the li it is superfluous;
                        # it is probably a list of related articles
                        utag = litag.find('ul')
                        if utag:
                            utag.extract()
                        # now skip paid subscriber articles if desired
                        subscriber_tag = litag.find(text="Subscriber Content")
                        if subscriber_tag:
                                if omit_paid_content:
                                    continue             
                                # delete the tip div so it doesn't get in the way
                                tiptag = litag.find("div", { "class" : "tipTargetBox" })
                                if tiptag:
                                    tiptag.extract()
                        h1tag = litag.h1
                        # if there's an h1 tag, it's parent is a div which should replace
                        # the li tag for the analysis
                        if h1tag:
                            litag = h1tag.parent                  
                        h5tag = litag.h5
                        if h5tag:
                            # section mame has changed
                            section_name = self.tag_to_string(h5tag,False)
                            #self.log("h5 Section %s" % section_name)
                            # delete the h5 tag so it doesn't get in the way
                            h5tag.extract()
                        url = article_url(litag)
                        if url == '':
                            continue
                        if url.startswith("/article"):
                            url = mainurl+url
                        if not url.startswith("http"):
                            continue
                        if not url.endswith(".html"):
                            continue
                        if 'video' in url:
                            continue
                        title = article_title(litag)
                        if title == '':
                            continue
                        #self.log("URL %s" % url)
                        #self.log("Title %s" % title)
                        pubdate = ''
                        #self.log("Date %s" % pubdate)
                        author = article_author(litag)
                        if author == '':
                            author = section_name
                        elif author == section_name:
                            author = ''
                        else:
                            author = section_name+': '+author
                        #if not author == '':
                        #    self.log("Author %s" % author)
                        description = article_summary(litag)
                        #if not description == '':
                        #    self.log("Description %s" % description)
                        if not articles.has_key(page_title):
                            articles[page_title] = []
                        articles[page_title].append(dict(title=title,url=url,date=pubdate,description=description,author=author,content=''))

        # customization notes: delete sections you are not interested in
        # set omit_paid_content to False if you want the paid content article previews
        sectionlist = ['Front Page','Commentary','World News','US News','Business','Markets',
                       'Technology','Personal Finance','Life & Style','Real Estate','Careers','Small Business']
        omit_paid_content = True
    
        if 'Front Page' in sectionlist:
            parse_index_page('/home-page','Front Page',omit_paid_content)
            ans.append('Front Page')
        if 'Commentary' in sectionlist:
            parse_index_page('/public/page/news-opinion-commentary.html','Commentary',omit_paid_content)
            ans.append('Commentary')
        if 'World News' in sectionlist:
            parse_index_page('/public/page/news-global-world.html','World News',omit_paid_content)
            ans.append('World News')
        if 'US News' in sectionlist:
            parse_index_page('/public/page/news-world-business.html','US News',omit_paid_content)
            ans.append('US News')
        if 'Business' in sectionlist:
            parse_index_page('/public/page/news-business-us.html','Business',omit_paid_content)
            ans.append('Business')
        if 'Markets' in sectionlist:
            parse_index_page('/public/page/news-financial-markets-stock.html','Markets',omit_paid_content)
            ans.append('Markets')
        if 'Technology' in sectionlist:
            parse_index_page('/public/page/news-tech-technology.html','Technology',omit_paid_content)
            ans.append('Technology')
        if 'Personal Finance' in sectionlist:
            parse_index_page('/public/page/news-personal-finance.html','Personal Finance',omit_paid_content)
            ans.append('Personal Finance')
        if 'Life & Style' in sectionlist:
            parse_index_page('/public/page/news-lifestyle-arts-entertainment.html','Life & Style',omit_paid_content)
            ans.append('Life & Style')
        if 'Real Estate' in sectionlist:
            parse_index_page('/public/page/news-real-estate-homes.html','Real Estate',omit_paid_content)
            ans.append('Real Estate')
        if 'Careers' in sectionlist:
            parse_index_page('/public/page/news-career-jobs.html','Careers',omit_paid_content)
            ans.append('Careers')
        if 'Small Business' in sectionlist:
            parse_index_page('/public/page/news-small-business-marketing.html','Small Business',omit_paid_content)
            ans.append('Small Business')

        ans = [(key, articles[key]) for key in ans if articles.has_key(key)]
        return ans
nickredding is offline  
Old 01-16-2010, 06:28 PM   #1172
kiklop74
Guru
kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.
 
kiklop74's Avatar
 
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
Quote:
Originally Posted by nickredding View Post
Wall Street Journal -- here is a recipe for the free parts of the Wall Street Journal, which are quite extensive.
What is the point of putting proprietary fonts in extra_css? None of the e-reader devices (other than iPod touch or iPhone) has them. It is better to use device independent font family names (serif, sans-serif, monospace etc.).
kiklop74 is offline  
Advert
Old 01-16-2010, 06:58 PM   #1173
nickredding
onlinenewsreader.net
nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'
 
Posts: 328
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
Quote:
What is the point of putting proprietary fonts in extra_css? None of the e-reader devices (other than iPod touch or iPhone) has them. It is better to use device independent font family names (serif, sans-serif, monospace etc.).
True. However, as stated in the recipe, the formatting is taken from the Calibre standard recipe because I don't believe in reinventing the wheel. It works fine on my K2, which illustrates the fact that CSS font specs fall back gracefully to defaults.
nickredding is offline  
Old 01-16-2010, 08:06 PM   #1174
evanmaastrigt
Connoisseur
evanmaastrigt doesn't litterevanmaastrigt doesn't litter
 
Posts: 78
Karma: 192
Join Date: Nov 2009
Device: Sony PRS-600
New recipe for Joop, improved recipes for 'Fokke en Sukke' and nrcnext

Quote:
Originally Posted by lorenzov View Post
Edwin, i must have been writing this whilst you were responding! it might save some time
That's why I said 'over the weekend'. And I can always find another way to waste my time :-)

Attached a Dutch package. In it are:
A new recipe for 'Joop', a Dutch political blog (and Huffington Post clone)
An improved recipe for 'Fokke en Sukke', the popular Dutch cartoons
An improved recipe for 'nrcnext', the blog from the the Dutch daily nrcnext

dutchpackage01.zip
evanmaastrigt is offline  
Old 01-16-2010, 10:52 PM   #1175
lorenzov
Member
lorenzov began at the beginning.
 
lorenzov's Avatar
 
Posts: 23
Karma: 12
Join Date: Jan 2010
Location: Edinburgh, UK
Device: SONY PRS600, Apple iPhone 3G
ledevoir recipe

hi Nic, attached is the free version. I noticed that there are many 'authors' available in the feeds. customise the recipe to include any other feed you want.

RE paid content, have a look at the wiki guide:
http://bugs.calibre-ebook.com/wiki/recipeGuide_advanced

you need to customise the fields, but the recipe will then be able to fetch the content as it is. (it looks like the feeds contain already paid content, but calibre fails to fetch these articles because they require subscription).

lorenzo
Attached Files
File Type: zip ledevoir.zip (1.5 KB, 188 views)
lorenzov is offline  
Advert
Old 01-17-2010, 01:05 AM   #1176
Nic
Junior Member
Nic began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Jan 2010
Device: Sony touch edition
Wow very kind of you, thank you very much.
Nic is offline  
Old 01-17-2010, 01:21 AM   #1177
Nic
Junior Member
Nic began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Jan 2010
Device: Sony touch edition
I did open the file you gave me but I am not able to get anything from ledevoir.com except the name of the section. Do you know what I should do?

Thank you very much
Nic is offline  
Old 01-17-2010, 01:34 AM   #1178
Nic
Junior Member
Nic began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Jan 2010
Device: Sony touch edition
I even have the beginning of the article, I click on it and then I just have the title and few other thing from the page, but not the text of the article.. Any help is really appreciated.
Nic is offline  
Old 01-17-2010, 08:20 AM   #1179
lorenzov
Member
lorenzov began at the beginning.
 
lorenzov's Avatar
 
Posts: 23
Karma: 12
Join Date: Jan 2010
Location: Edinburgh, UK
Device: SONY PRS600, Apple iPhone 3G
Hi Nic,
i assumed that it was a daily source and specified 1 as oldest article. however it seems that they have not published anything today therefore sections come out empty as there is nothing fitting the criteria.

change the value to at least 2 and you should get your feeds

by the way what format are you using and on which device?

lorenzo
lorenzov is offline  
Old 01-17-2010, 02:43 PM   #1180
nickredding
onlinenewsreader.net
nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'
 
Posts: 328
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
National Post (Canada)

The standard recipe for the National Post does not accomodate articles which are continued on a second page (url). Replace the method preprocess_html with the following code to ensure the complete article is downloaded in these cases:

Code:
    def preprocess_html(self, soup):
        story = soup.find(name='div', attrs={'class':'triline'})
        page2_link = soup.find('p','pagenav')
        if page2_link:
            atag = page2_link.find('a',href=True)
            if atag:
                page2_url = atag['href']
                if page2_url.startswith('story'):
                         page2_url = 'http://www.nationalpost.com/todays-paper/'+url
                elif page2_url.startswith( '/todays-paper/story.html'):
                    page2_url = 'http://www.nationalpost.com/'+page2_url   
                page2_soup = self.index_to_soup(page2_url)
                if page2_soup:
                    page2_content = page2_soup.find('div','story-content')
                    if page2_content:
                        full_story = BeautifulSoup('<div></div>')
                        full_story.insert(0,story)
                        full_story.insert(1,page2_content)
                        story = full_story
        soup = BeautifulSoup('<html><head><title>t</title></head><body></body></html>')
        body = soup.find(name='body')
        body.insert(0, story)
        return soup
nickredding is offline  
Old 01-17-2010, 09:33 PM   #1181
evanmaastrigt
Connoisseur
evanmaastrigt doesn't litterevanmaastrigt doesn't litter
 
Posts: 78
Karma: 192
Join Date: Nov 2009
Device: Sony PRS-600
Quote:
Originally Posted by nickredding View Post
Comments on how to make the recipe run faster would be welcome--I think it's mainly a function of the quantity of material.
Yes, it takes some time to download - 31 minutes resulting in a 7.1 MB ePub (and a very hot processor) on this end - but I agree, that has to do with the sheer amount of content.

But the WSJ is a daily, and this recipe includes articles from more than a month ago (these have a lot of pictures also ;-). So maybe it is an idea to restrict the articles to those that have been published in say, the last two days. The webpage for the article has the publication date, so some clever date parsing will do the trick.

There are other techniques for optimizing, but they tend to make a big mess of your code, and frankly I think they will turn out to be waste of time.
evanmaastrigt is offline  
Old 01-18-2010, 12:25 AM   #1182
Nic
Junior Member
Nic began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Jan 2010
Device: Sony touch edition
Hi Lorenzov,

It was already 7 at oldest article. I tried several things and I still have the same results. I got the SONY PRS600 and I was using EPUB format for newspapers. But I don't think it has something to do with the problem because when I visualize in calibre I got the titles without text as well.

Is this recipe working for you? Tell me if you see a solution.

Thank you again for your help!
Nic is offline  
Old 01-18-2010, 12:56 AM   #1183
martys5150
Junior Member
martys5150 began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Sep 2008
Device: Sony 505, ipod touch, ebookwise
Would anyone be able to make one for http://www.kitsapsun.com/
martys5150 is offline  
Old 01-18-2010, 10:47 AM   #1184
kiklop74
Guru
kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.
 
kiklop74's Avatar
 
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
Here goes recipe for Kitsap Sun:
Attached Files
File Type: zip kitsapun.zip (3.4 KB, 185 views)
kiklop74 is offline  
Old 01-18-2010, 11:44 AM   #1185
evanmaastrigt
Connoisseur
evanmaastrigt doesn't litterevanmaastrigt doesn't litter
 
Posts: 78
Karma: 192
Join Date: Nov 2009
Device: Sony PRS-600
New recipe for the Yemen Times

Here is a new recipe for the Yemen Times
yementimes.zip
evanmaastrigt is offline  
Closed Thread


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Custom column read ? pchrist7 Calibre 2 10-04-2010 02:52 AM
Archive for custom screensavers sleeplessdave Amazon Kindle 1 07-07-2010 12:33 PM
How to back up preferences and custom recipes? greenapple Calibre 3 03-29-2010 05:08 AM
Donations for Custom Recipes ddavtian Calibre 5 01-23-2010 04:54 PM
Help understanding custom recipes andersent Calibre 0 12-17-2009 02:37 PM


All times are GMT -4. The time now is 05:44 AM.


MobileRead.com is a privately owned, operated and funded community.