![]() |
#1201 |
Little Fuzzy Soldier
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 580
Karma: 5711
Join Date: Sep 2008
Location: Nowhere in particular.
Device: cybook gen3, htc hero, ipaq 214
|
Please would it be possible to make a recipe for readitlaterlist.com? Thanks.
|
![]() |
![]() |
#1202 |
onlinenewsreader.net
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 327
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
|
Wall Street Journal (free)
I have updated this recipe (thanks to kiklop74 and evanmaastrigt for suggestions) to imporve formatting and limit article downloads according to oldest_article. I have also improved the tag filtering to remove extraneous content and moved the customization area to the top of the recipe.
Code:
#!/usr/bin/env python __license__ = 'GPL v3' ''' online.wsj.com ''' import string, re from calibre import strftime from calibre.web.feeds.recipes import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, NavigableString from datetime import timedelta, datetime, date class WSJ(BasicNewsRecipe): # formatting adapted from original recipe by Kovid Goyal and Sujata Raman title = u'Wall Street Journal (free)' __author__ = 'Nick Redding' language = 'en' description = ('All the free content from the Wall Street Journal (business, financial and political news)') no_stylesheets = True timefmt = ' [%b %d]' # customization notes: delete sections you are not interested in # set omit_paid_content to False if you want the paid content article snippets # set oldest_article to the maximum number of days back from today to include articles sectionlist = [ ['/home-page','Front Page'], ['/public/page/news-opinion-commentary.html','Commentary'], ['/public/page/news-global-world.html','World News'], ['/public/page/news-world-business.html','US News'], ['/public/page/news-business-us.html','Business'], ['/public/page/news-financial-markets-stock.html','Markets'], ['/public/page/news-tech-technology.html','Technology'], ['/public/page/news-personal-finance.html','Personal Finnce'], ['/public/page/news-lifestyle-arts-entertainment.html','Life & Style'], ['/public/page/news-real-estate-homes.html','Real Estate'], ['/public/page/news-career-jobs.html','Careers'], ['/public/page/news-small-business-marketing.html','Small Business'] ] oldest_article = 2 omit_paid_content = True extra_css = '''h1{font-size:large; font-family:Times,serif;} h2{font-family:Times,serif; font-size:small; font-style:italic;} .subhead{font-family:Times,serif; font-size:small; font-style:italic;} .insettipUnit {font-family:Times,serif;font-size:xx-small;} .targetCaption{font-size:x-small; font-family:Times,serif; font-style:italic; margin-top: 0.25em;} .article{font-family:Times,serif; font-size:x-small;} .tagline { font-size:xx-small;} .dateStamp {font-family:Times,serif;} h3{font-family:Times,serif; font-size:xx-small;} .byline {font-family:Times,serif; font-size:xx-small; list-style-type: none;} .metadataType-articleCredits {list-style-type: none;} h6{font-family:Times,serif; font-size:small; font-style:italic;} .paperLocation{font-size:xx-small;}''' remove_tags_before = dict({'class':re.compile('^articleHeadlineBox')}) remove_tags = [ dict({'id':re.compile('^articleTabs_tab_')}), #dict(id=["articleTabs_tab_article", "articleTabs_tab_comments", # "articleTabs_tab_interactive","articleTabs_tab_video", # "articleTabs_tab_map","articleTabs_tab_slideshow"]), {'class': ['footer_columns','network','insetCol3wide','interactive','video','slideshow','map', 'insettip','insetClose','more_in', "insetContent", # 'articleTools_bottom','articleTools_bottom mjArticleTools', 'aTools', 'tooltip', 'adSummary', 'nav-inline','insetFullBracket']}, dict({'class':re.compile('^articleTools_bottom')}), dict(rel='shortcut icon') ] remove_tags_after = [dict(id="article_story_body"), {'class':"article story"}] def get_browser(self): br = BasicNewsRecipe.get_browser() return br def preprocess_html(self,soup): # check if article is too old datetag = soup.find('li',attrs={'class' : re.compile("^dateStamp")}) if datetag: dateline_string = self.tag_to_string(datetag,False) date_items = dateline_string.split(',') datestring = date_items[0]+date_items[1] article_date = datetime.strptime(datestring.title(),"%B %d %Y") earliest_date = date.today() - timedelta(days=self.oldest_article) if article_date.date() < earliest_date: self.log("Skipping article dated %s" % datestring) return None datetag.parent.extract() # place dateline in article heading bylinetag = soup.find('h3','byline') if bylinetag: h3bylinetag = bylinetag else: bylinetag = soup.find('li','byline') if bylinetag: h3bylinetag = bylinetag.h3 if not h3bylinetag: h3bylinetag = bylinetag bylinetag = bylinetag.parent if bylinetag: if h3bylinetag.a: bylinetext = 'By '+self.tag_to_string(h3bylinetag.a,False) else: bylinetext = self.tag_to_string(h3bylinetag,False) h3byline = Tag(soup,'h3',[('class','byline')]) if bylinetext.isspace() or (bylinetext == ''): h3byline.insert(0,NavigableString(date_items[0]+','+date_items[1])) else: h3byline.insert(0,NavigableString(bylinetext+u'\u2014'+date_items[0]+','+date_items[1])) bylinetag.replaceWith(h3byline) else: headlinetag = soup.find('div',attrs={'class' : re.compile("^articleHeadlineBox")}) if headlinetag: dateline = Tag(soup,'h3', [('class','byline')]) dateline.insert(0,NavigableString(date_items[0]+','+date_items[1])) headlinetag.insert(len(headlinetag),dateline) else: # if no date tag, don't process this page--it's not a news item return None # This gets rid of the annoying superfluous bullet symbol preceding columnist bylines ultag = soup.find('ul',attrs={'class' : 'cMetadata metadataType-articleCredits'}) if ultag: a = ultag.h3 if a: ultag.replaceWith(a) return soup def parse_index(self): articles = {} key = None ans = [] def parse_index_page(page_name,page_title): def article_title(tag): atag = tag.find('h2') # title is usually in an h2 tag if not atag: # if not, get text from the a tag atag = tag.find('a',href=True) if not atag: return '' t = self.tag_to_string(atag,False) if t == '': # sometimes the title is in the second a tag atag.extract() atag = tag.find('a',href=True) if not atag: return '' return self.tag_to_string(atag,False) return t return self.tag_to_string(atag,False) def article_author(tag): atag = tag.find('strong') # author is usually in a strong tag if not atag: atag = tag.find('h4') # if not, look for an h4 tag if not atag: return '' return self.tag_to_string(atag,False) def article_summary(tag): atag = tag.find('p') if not atag: return '' subtag = atag.strong if subtag: subtag.extract() return self.tag_to_string(atag,False) def article_url(tag): atag = tag.find('a',href=True) if not atag: return '' url = re.sub(r'\?.*', '', atag['href']) return url def handle_section_name(tag): # turns a tag into a section name with special processing # for Wat's News, U.S., World & U.S. and World s = self.tag_to_string(tag,False) if ("What" in s) and ("News" in s): s = "What's News" elif (s == "U.S.") or (s == "World & U.S.") or (s == "World"): s = s + " News" return s mainurl = 'http://online.wsj.com' pageurl = mainurl+page_name #self.log("Page url %s" % pageurl) soup = self.index_to_soup(pageurl) # Find each instance of div with class including "headlineSummary" for divtag in soup.findAll('div',attrs={'class' : re.compile("^headlineSummary")}): # divtag contains all article data as ul's and li's # first, check if there is an h3 tag which provides a section name stag = divtag.find('h3') if stag: if stag.parent['class'] == 'dynamic': # a carousel of articles is too complex to extract a section name # for each article, so we'll just call the section "Carousel" section_name = 'Carousel' else: section_name = handle_section_name(stag) else: section_name = "What's News" #self.log("div Section %s" % section_name) # find each top-level ul in the div # we don't restrict to class = newsItem because the section_name # sometimes changes via a ul tag inside the div for ultag in divtag.findAll('ul',recursive=False): stag = ultag.find('h3') if stag: if stag.parent.name == 'ul': # section name has changed section_name = handle_section_name(stag) #self.log("ul Section %s" % section_name) # delete the h3 tag so it doesn't get in the way stag.extract() # find each top level li in the ul for litag in ultag.findAll('li',recursive=False): stag = litag.find('h3') if stag: # section name has changed section_name = handle_section_name(stag) #self.log("li Section %s" % section_name) # delete the h3 tag so it doesn't get in the way stag.extract() # if there is a ul tag inside the li it is superfluous; # it is probably a list of related articles utag = litag.find('ul') if utag: utag.extract() # now skip paid subscriber articles if desired subscriber_tag = litag.find(text="Subscriber Content") if subscriber_tag: if self.omit_paid_content: continue # delete the tip div so it doesn't get in the way tiptag = litag.find("div", { "class" : "tipTargetBox" }) if tiptag: tiptag.extract() h1tag = litag.h1 # if there's an h1 tag, it's parent is a div which should replace # the li tag for the analysis if h1tag: litag = h1tag.parent h5tag = litag.h5 if h5tag: # section mame has changed section_name = self.tag_to_string(h5tag,False) #self.log("h5 Section %s" % section_name) # delete the h5 tag so it doesn't get in the way h5tag.extract() url = article_url(litag) if url == '': continue if url.startswith("/article"): url = mainurl+url if not url.startswith("http://online.wsj.com"): continue if not url.endswith(".html"): continue if 'video' in url: continue title = article_title(litag) if title == '': continue #self.log("URL %s" % url) #self.log("Title %s" % title) pubdate = '' #self.log("Date %s" % pubdate) author = article_author(litag) if author == '': author = section_name elif author == section_name: author = '' else: author = section_name+': '+author #if not author == '': # self.log("Author %s" % author) description = article_summary(litag) #if not description == '': # self.log("Description %s" % description) if not articles.has_key(page_title): articles[page_title] = [] articles[page_title].append(dict(title=title,url=url,date=pubdate,description=description,author=author,content='')) for page_name,page_title in self.sectionlist: parse_index_page(page_name,page_title) ans.append(page_title) ans = [(key, articles[key]) for key in ans if articles.has_key(key)] return ans |
![]() |
![]() |
#1203 |
Junior Member
![]() Posts: 8
Karma: 10
Join Date: Jul 2009
Location: Massachusetts
Device: nook
|
I was wondering if the Wall Street Journal (US) [subscription] recipe stopped working for anyone else? Thanks.
|
![]() |
![]() |
#1204 |
onlinenewsreader.net
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 327
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
|
Canadian Newspapers--CanWest chain
The CanWest chain of Candian newspapers all use the same web format. Here is a recipe that will handle any of them--just un-comment the three lines in the header corresponding to the paper you want.
Code:
#!/usr/bin/env python __license__ = 'GPL v3' ''' www.canada.com ''' import string, re from calibre import strftime from calibre.web.feeds.news import BasicNewsRecipe import string, re from calibre import strftime from calibre.web.feeds.recipes import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag class CanWestPaper(BasicNewsRecipe): # un-comment the following three lines for the Victoria Times Colonist #title = u'Victoria Times Colonist' #url_prefix = 'http://www.timescolonist.com' #description = u'News from Victoria, BC' # un-comment the following three lines for the Vancouver Province #title = u'Vancouver Province' #url_prefix = 'http://www.theprovince.com' #description = u'News from Vancouver, BC' # un-comment the following three lines for the Vancouver Sun #title = u'Vancouver Sun' #url_prefix = 'http://www.vancouversun.com' #description = u'News from Vancouver, BC' # un-comment the following three lines for the Edmonton Journal #title = u'Edmonton Journal' #url_prefix = 'http://www.edmontonjournal.com' #description = u'News from Edmonton, AB' # un-comment the following three lines for the Calgary Herald #title = u'Calgary Herald' #url_prefix = 'http://www.calgaryherald.com' #description = u'News from Calgary, AB' # un-comment the following three lines for the Regina Leader-Post #title = u'Regina Leader-Post' #url_prefix = 'http://www.leaderpost.com' #description = u'News from Regina, SK' # un-comment the following three lines for the Saskatoon Star-Phoenix #title = u'Saskatoon Star-Phoenix' #url_prefix = 'http://www.thestarphoenix.com' #description = u'News from Saskatoon, SK' # un-comment the following three lines for the Windsor Star #title = u'Windsor Star' #url_prefix = 'http://www.windsorstar.com' #description = u'News from Windsor, ON' # un-comment the following three lines for the Ottawa Citizen #title = u'Ottawa Citizen' #url_prefix = 'http://www.ottawacitizen.com' #description = u'News from Ottawa, ON' # un-comment the following three lines for the Montreal Gazette #title = u'Montreal Gazette' #url_prefix = 'http://www.montrealgazette.com' #description = u'News from Montreal, QC' language = 'en_CA' __author__ = 'Nick Redding' no_stylesheets = True timefmt = ' [%b %d]' extra_css = ''' .timestamp { font-size:xx-small; display: block; } #storyheader { font-size: medium; } #storyheader h1 { font-size: x-large; } #storyheader h2 { font-size: large; font-style: italic; } .byline { font-size:xx-small; } #photocaption { font-size: small; font-style: italic } #photocredit { font-size: xx-small; }''' keep_only_tags = [dict(name='div', attrs={'id':'storyheader'}),dict(name='div', attrs={'id':'storycontent'})] remove_tags = [{'class':'comments'}, dict(name='div', attrs={'class':'navbar'}),dict(name='div', attrs={'class':'morelinks'}), dict(name='div', attrs={'class':'viewmore'}),dict(name='li', attrs={'class':'email'}), dict(name='div', attrs={'class':'story_tool_hr'}),dict(name='div', attrs={'class':'clear'}), dict(name='div', attrs={'class':'story_tool'}),dict(name='div', attrs={'class':'copyright'}), dict(name='div', attrs={'class':'rule_grey_solid'}), dict(name='li', attrs={'class':'print'}),dict(name='li', attrs={'class':'share'}),dict(name='ul', attrs={'class':'bullet'})] def preprocess_html(self,soup): #delete iempty id attributes--they screw up the TOC for unknow reasons divtags = soup.findAll('div',attrs={'id':''}) if divtags: for div in divtags: del(div['id']) return soup def parse_index(self): soup = self.index_to_soup(self.url_prefix+'/news/todays-paper/index.html') articles = {} key = 'News' ans = ['News'] # Find each instance of class="sectiontitle", class="featurecontent" for divtag in soup.findAll('div',attrs={'class' : ["section_title02","featurecontent"]}): #self.log(" div class = %s" % divtag['class']) if divtag['class'].startswith('section_title'): # div contains section title if not divtag.h3: continue key = self.tag_to_string(divtag.h3,False) ans.append(key) self.log("Section name %s" % key) continue # div contains article data h1tag = divtag.find('h1') if not h1tag: continue atag = h1tag.find('a',href=True) if not atag: continue url = self.url_prefix+'/news/todays-paper/'+atag['href'] #self.log("Section %s" % key) #self.log("url %s" % url) title = self.tag_to_string(atag,False) #self.log("title %s" % title) pubdate = '' description = '' ptag = divtag.find('p'); if ptag: description = self.tag_to_string(ptag,False) #self.log("description %s" % description) author = '' autag = divtag.find('h4') if autag: author = self.tag_to_string(autag,False) #self.log("author %s" % author) if not articles.has_key(key): articles[key] = [] articles[key].append(dict(title=title,url=url,date=pubdate,description=description,author=author,content='')) ans = [(key, articles[key]) for key in ans if articles.has_key(key)] return ans |
![]() |
![]() |
#1205 |
Junior Member
![]() Posts: 9
Karma: 10
Join Date: Jan 2010
Device: Sony touch edition
|
Can somebody make a recipe for www.ledevoir.com ? It would be really appreciated.
|
![]() |
![]() |
#1206 |
Connoisseur
![]() ![]() Posts: 78
Karma: 192
Join Date: Nov 2009
Device: Sony PRS-600
|
Problem with Wall Street Journal (free) recipe
There is a problem with the Wall Street Journal (free) recipe. It is on line 85:
Code:
article_date = datetime.strptime(datestring.title(),"%B %d %Y") Code:
ValueError: time data 'January 21 2010' does not match format '%B %d %Y' |
![]() |
![]() |
#1207 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
|
New recipe for Le Devoir:
|
![]() |
![]() |
#1208 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 428
Karma: 2370
Join Date: Jun 2006
Location: Germany
Device: Nokia 770, Ilead, Cybook G3, Kindle DX, Kindle 2, iPad, Kindle 3, PW
|
Maybe i was a little bit too ambitious todays. I tried to create my first own recipe and failed ... royaly
![]() So maybe comeone could lend me a little help here and make one recipe for this please? www.welt.de |
![]() |
![]() |
#1209 |
Member
![]() Posts: 16
Karma: 10
Join Date: Jan 2010
Device: kindle 2i
|
Could some one please make one for popular science at
http://www.popsci.com/gadgets http://www.popsci.com/technology and http://www.popsci.com/diy thank you |
![]() |
![]() |
#1210 |
onlinenewsreader.net
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 327
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
|
re: Problem with Wall Street Journal (free) recipe
Interesting point. I'm not sure how to fix this since the WSJ date string that is being decoded is in the US locale format. The recipe would have to specify that and I don't see any format options that would do that. Any suggestions?
|
![]() |
![]() |
#1211 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,373
Karma: 27230406
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You can always write your own date parser, though what I did in the non free version is simply use the WSJ provided string as timefmt
|
![]() |
![]() |
#1212 |
onlinenewsreader.net
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 327
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
|
re: Problem with Wall Street Journal (free) recipe
The offending strptime function can be enclosed in a try ... except statement to prevent the recipe method from halting in the case of a locale error. In this case, filtering against oldest_article could be omitted, in which case all articles would be included.
If I can't figure out a solution that handles the locale issue properly, I'll do that. |
![]() |
![]() |
#1213 | |
onlinenewsreader.net
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 327
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
|
Quote:
|
|
![]() |
![]() |
#1214 | |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,373
Karma: 27230406
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Quote:
Code:
date = date.split() month = {'January', February', ...}[date[0]] day = int(date[1]) year = int(date[2]) |
|
![]() |
![]() |
#1215 |
Connoisseur
![]() Posts: 51
Karma: 10
Join Date: Dec 2008
Location: Germany
Device: SONY PRS-500
|
Recipe for Columbia Journalism Review (CJR)
Hi,
I'm attaching a file that contains a recipe for the Columbia Journalism Review. Enjoy... XG |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Custom column read ? | pchrist7 | Calibre | 2 | 10-04-2010 02:52 AM |
Archive for custom screensavers | sleeplessdave | Amazon Kindle | 1 | 07-07-2010 12:33 PM |
How to back up preferences and custom recipes? | greenapple | Calibre | 3 | 03-29-2010 05:08 AM |
Donations for Custom Recipes | ddavtian | Calibre | 5 | 01-23-2010 04:54 PM |
Help understanding custom recipes | andersent | Calibre | 0 | 12-17-2009 02:37 PM |