![]() |
#1216 |
Connoisseur
![]() Posts: 51
Karma: 10
Join Date: Dec 2008
Location: Germany
Device: SONY PRS-500
|
Ordering of Recipes in Calibre's Add A Custom News Source
Kovid,
Is it possible to order alphabetically the list of news sources in the "Add a custom news source" list? If that option is already built-in, how can I order my list? My list is not currently alphabetized. It would be so much nicer for me if I could have those recipes listed in alphabetical order. Thanks... XG |
![]() |
![]() |
#1217 |
Connoisseur
![]() Posts: 51
Karma: 10
Join Date: Dec 2008
Location: Germany
Device: SONY PRS-500
|
Recipes in Custom List Not Found in Main English List
Kovid,
I found two recipes in my Custom list that are not listed in the English list. They are: - NRC International - Politiken - English two great recipes written by kwetal. They both work fine and I schedule them for download. But...why don't they appear in the English list? Have I somehow moved them to the Custom list or were they ever listed in the English list? I ask because I was working on creating both recipes when I discovered that they had already been done. (Thanks kwetal.) Any help or explanation would be greatly appreciated. Bye... XG PS Is it possible to extend the login time period? Quite often I'm experimenting with recipes and when I come back here, I have to log back in. |
![]() |
![]() |
#1218 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,386
Karma: 27756918
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Custom recipes are always listed only in the custom recipes section. Only builtin recipes are listed in the language sections.
|
![]() |
![]() |
#1219 | |
Zealot
![]() ![]() ![]() Posts: 118
Karma: 210
Join Date: Jan 2010
Location: Mid-Tennessee
Device: PRS-300
|
HUGE thanks for this one!!
Quote:
|
|
![]() |
![]() |
#1220 |
Connoisseur
![]() Posts: 51
Karma: 10
Join Date: Dec 2008
Location: Germany
Device: SONY PRS-500
|
NRC and Politiken Recipes from kwetal
Kovid,
Thanks for clarification. Can you add kwetal's two recipes to the English list? The NRC English news source is Nederland and the Politiken news source is in Denmark. See the zipped attachment. Thanks... XG |
![]() |
![]() |
#1221 |
Zealot
![]() ![]() ![]() Posts: 118
Karma: 210
Join Date: Jan 2010
Location: Mid-Tennessee
Device: PRS-300
|
Would it be possible to get a recipe from this location? http://www.hillsdale.edu/news/imprimis.asp
Thanks. |
![]() |
![]() |
#1222 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,386
Karma: 27756918
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
NRC is under netherlands, use the search to find it.
|
![]() |
![]() |
#1223 |
onlinenewsreader.net
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 328
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
|
re: Problem with Wall Street Journal (free) recipe
I should know better than to post in forums before I've finiished my coffee. I was coding the solution to date locales when kovid posted his suggestion. Here is the fixed recipe, which manually decodes the WSJ US locale dateline for comparison. evanmaastrigt--if you could test this in your locale I'd appreciate it.
Code:
#!/usr/bin/env python __license__ = 'GPL v3' ''' online.wsj.com ''' import string, re from calibre import strftime from calibre.web.feeds.recipes import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, NavigableString from datetime import timedelta, datetime, date class WSJ(BasicNewsRecipe): # formatting adapted from original recipe by Kovid Goyal and Sujata Raman title = u'Wall Street Journal (free)' __author__ = 'Nick Redding' language = 'en' description = ('All the free content from the Wall Street Journal (business, financial and political news)') no_stylesheets = True timefmt = ' [%b %d]' # customization notes: delete sections you are not interested in # set omit_paid_content to False if you want the paid content article snippets # set oldest_article to the maximum number of days back from today to include articles sectionlist = [ ['/home-page','Front Page'], ['/public/page/news-opinion-commentary.html','Commentary'], ['/public/page/news-global-world.html','World News'], ['/public/page/news-world-business.html','US News'], ['/public/page/news-business-us.html','Business'], ['/public/page/news-financial-markets-stock.html','Markets'], ['/public/page/news-tech-technology.html','Technology'], ['/public/page/news-personal-finance.html','Personal Finnce'], ['/public/page/news-lifestyle-arts-entertainment.html','Life & Style'], ['/public/page/news-real-estate-homes.html','Real Estate'], ['/public/page/news-career-jobs.html','Careers'], ['/public/page/news-small-business-marketing.html','Small Business'] ] oldest_article = 2 omit_paid_content = True extra_css = '''h1{font-size:large; font-family:Times,serif;} h2{font-family:Times,serif; font-size:small; font-style:italic;} .subhead{font-family:Times,serif; font-size:small; font-style:italic;} .insettipUnit {font-family:Times,serif;font-size:xx-small;} .targetCaption{font-size:x-small; font-family:Times,serif; font-style:italic; margin-top: 0.25em;} .article{font-family:Times,serif; font-size:x-small;} .tagline { font-size:xx-small;} .dateStamp {font-family:Times,serif;} h3{font-family:Times,serif; font-size:xx-small;} .byline {font-family:Times,serif; font-size:xx-small; list-style-type: none;} .metadataType-articleCredits {list-style-type: none;} h6{font-family:Times,serif; font-size:small; font-style:italic;} .paperLocation{font-size:xx-small;}''' remove_tags_before = dict({'class':re.compile('^articleHeadlineBox')}) remove_tags = [ dict({'id':re.compile('^articleTabs_tab_')}), #dict(id=["articleTabs_tab_article", "articleTabs_tab_comments", # "articleTabs_tab_interactive","articleTabs_tab_video", # "articleTabs_tab_map","articleTabs_tab_slideshow"]), {'class': ['footer_columns','network','insetCol3wide','interactive','video','slideshow','map', 'insettip','insetClose','more_in', "insetContent", # 'articleTools_bottom','articleTools_bottom mjArticleTools', 'aTools', 'tooltip', 'adSummary', 'nav-inline','insetFullBracket']}, dict({'class':re.compile('^articleTools_bottom')}), dict(rel='shortcut icon') ] remove_tags_after = [dict(id="article_story_body"), {'class':"article story"}] def get_browser(self): br = BasicNewsRecipe.get_browser() return br def preprocess_html(self,soup): def decode_us_date(datestr): udate = datestr.strip().lower().split() m = ['january','february','march','april','may','june','july','august','september','october','november','december'].index(udate[0])+1 d = int(udate[1]) y = int(udate[2]) return date(y,m,d) # check if article is paid content if self.omit_paid_content: divtags = soup.findAll('div','tooltip') if divtags: for divtag in divtags: if divtag.find(text="Subscriber Content"): return None # check if article is too old datetag = soup.find('li',attrs={'class' : re.compile("^dateStamp")}) if datetag: dateline_string = self.tag_to_string(datetag,False) date_items = dateline_string.split(',') datestring = date_items[0]+date_items[1] article_date = decode_us_date(datestring) earliest_date = date.today() - timedelta(days=self.oldest_article) if article_date < earliest_date: self.log("Skipping article dated %s" % datestring) return None datetag.parent.extract() # place dateline in article heading bylinetag = soup.find('h3','byline') if bylinetag: h3bylinetag = bylinetag else: bylinetag = soup.find('li','byline') if bylinetag: h3bylinetag = bylinetag.h3 if not h3bylinetag: h3bylinetag = bylinetag bylinetag = bylinetag.parent if bylinetag: if h3bylinetag.a: bylinetext = 'By '+self.tag_to_string(h3bylinetag.a,False) else: bylinetext = self.tag_to_string(h3bylinetag,False) h3byline = Tag(soup,'h3',[('class','byline')]) if bylinetext.isspace() or (bylinetext == ''): h3byline.insert(0,NavigableString(date_items[0]+','+date_items[1])) else: h3byline.insert(0,NavigableString(bylinetext+u'\u2014'+date_items[0]+','+date_items[1])) bylinetag.replaceWith(h3byline) else: headlinetag = soup.find('div',attrs={'class' : re.compile("^articleHeadlineBox")}) if headlinetag: dateline = Tag(soup,'h3', [('class','byline')]) dateline.insert(0,NavigableString(date_items[0]+','+date_items[1])) headlinetag.insert(len(headlinetag),dateline) else: # if no date tag, don't process this page--it's not a news item return None # This gets rid of the annoying superfluous bullet symbol preceding columnist bylines ultag = soup.find('ul',attrs={'class' : 'cMetadata metadataType-articleCredits'}) if ultag: a = ultag.h3 if a: ultag.replaceWith(a) return soup def parse_index(self): articles = {} key = None ans = [] def parse_index_page(page_name,page_title): def article_title(tag): atag = tag.find('h2') # title is usually in an h2 tag if not atag: # if not, get text from the a tag atag = tag.find('a',href=True) if not atag: return '' t = self.tag_to_string(atag,False) if t == '': # sometimes the title is in the second a tag atag.extract() atag = tag.find('a',href=True) if not atag: return '' return self.tag_to_string(atag,False) return t return self.tag_to_string(atag,False) def article_author(tag): atag = tag.find('strong') # author is usually in a strong tag if not atag: atag = tag.find('h4') # if not, look for an h4 tag if not atag: return '' return self.tag_to_string(atag,False) def article_summary(tag): atag = tag.find('p') if not atag: return '' subtag = atag.strong if subtag: subtag.extract() return self.tag_to_string(atag,False) def article_url(tag): atag = tag.find('a',href=True) if not atag: return '' url = re.sub(r'\?.*', '', atag['href']) return url def handle_section_name(tag): # turns a tag into a section name with special processing # for Wat's News, U.S., World & U.S. and World s = self.tag_to_string(tag,False) if ("What" in s) and ("News" in s): s = "What's News" elif (s == "U.S.") or (s == "World & U.S.") or (s == "World"): s = s + " News" return s mainurl = 'http://online.wsj.com' pageurl = mainurl+page_name #self.log("Page url %s" % pageurl) soup = self.index_to_soup(pageurl) # Find each instance of div with class including "headlineSummary" for divtag in soup.findAll('div',attrs={'class' : re.compile("^headlineSummary")}): # divtag contains all article data as ul's and li's # first, check if there is an h3 tag which provides a section name stag = divtag.find('h3') if stag: if stag.parent['class'] == 'dynamic': # a carousel of articles is too complex to extract a section name # for each article, so we'll just call the section "Carousel" section_name = 'Carousel' else: section_name = handle_section_name(stag) else: section_name = "What's News" #self.log("div Section %s" % section_name) # find each top-level ul in the div # we don't restrict to class = newsItem because the section_name # sometimes changes via a ul tag inside the div for ultag in divtag.findAll('ul',recursive=False): stag = ultag.find('h3') if stag: if stag.parent.name == 'ul': # section name has changed section_name = handle_section_name(stag) #self.log("ul Section %s" % section_name) # delete the h3 tag so it doesn't get in the way stag.extract() # find each top level li in the ul for litag in ultag.findAll('li',recursive=False): stag = litag.find('h3') if stag: # section name has changed section_name = handle_section_name(stag) #self.log("li Section %s" % section_name) # delete the h3 tag so it doesn't get in the way stag.extract() # if there is a ul tag inside the li it is superfluous; # it is probably a list of related articles utag = litag.find('ul') if utag: utag.extract() # now skip paid subscriber articles if desired subscriber_tag = litag.find(text="Subscriber Content") if subscriber_tag: if self.omit_paid_content: continue # delete the tip div so it doesn't get in the way tiptag = litag.find("div", { "class" : "tipTargetBox" }) if tiptag: tiptag.extract() h1tag = litag.h1 # if there's an h1 tag, it's parent is a div which should replace # the li tag for the analysis if h1tag: litag = h1tag.parent h5tag = litag.h5 if h5tag: # section mame has changed section_name = self.tag_to_string(h5tag,False) #self.log("h5 Section %s" % section_name) # delete the h5 tag so it doesn't get in the way h5tag.extract() url = article_url(litag) if url == '': continue if url.startswith("/article"): url = mainurl+url if not url.startswith("http://online.wsj.com"): continue if not url.endswith(".html"): continue if 'video' in url: continue title = article_title(litag) if title == '': continue #self.log("URL %s" % url) #self.log("Title %s" % title) pubdate = '' #self.log("Date %s" % pubdate) author = article_author(litag) if author == '': author = section_name elif author == section_name: author = '' else: author = section_name+': '+author #if not author == '': # self.log("Author %s" % author) description = article_summary(litag) #if not description == '': # self.log("Description %s" % description) if not articles.has_key(page_title): articles[page_title] = [] articles[page_title].append(dict(title=title,url=url,date=pubdate,description=description,author=author,content='')) for page_name,page_title in self.sectionlist: parse_index_page(page_name,page_title) ans.append(page_title) ans = [(key, articles[key]) for key in ans if articles.has_key(key)] return ans |
![]() |
![]() |
#1224 | |
Connoisseur
![]() Posts: 51
Karma: 10
Join Date: Dec 2008
Location: Germany
Device: SONY PRS-500
|
NRC next vs. NRC International
Quote:
The nrc entry under the Dutch section is in the Dutch language. The NRC International service in English and I can't find it listed in any of the English lists of Calibre. XG |
|
![]() |
![]() |
#1225 |
Connoisseur
![]() Posts: 51
Karma: 10
Join Date: Dec 2008
Location: Germany
Device: SONY PRS-500
|
the nrc recipes
Kovid,
Sorry, I meant to reply to this post of yours. I'm not sure what happened. Anyway...The nrc entry under the Dutch section is in the Dutch language. The NRC International service in English and I can't find it listed in any of the English lists of Calibre. XG |
![]() |
![]() |
#1226 |
Connoisseur
![]() ![]() Posts: 78
Karma: 192
Join Date: Nov 2009
Device: Sony PRS-600
|
Well, I never noticed before, but it isn't. What is there is 'nrcnext', the newsblog of the sister publication of the 'NRC Handelsblad'. 'NRC International' offers the most interesting articles of the latter in an English translation.
I never made a recipe for 'NRC Handelsblad' because they offer a DRM-free subscription for an electronic version (ePub, Mobi or PDF) for 84 euros/year. A bargain for what is sort of the New York Times of the Netherlands. In addition there is also the 'Fokke en Sukke' recipe that combines the cartoons published both in 'nrcnext' and 'NRC Handelsblad'. (And yes, we have 25 political parties as well :-) Edwin |
![]() |
![]() |
#1227 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,386
Karma: 27756918
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
@evanmasstrigt: I'm happy to support their efforts to provide a ebook version. I have been planning to write a sub class for BasicNewsRecipe that allows download of news published in EPUB format (via the subscription) and outputs the news in OPF+HTML as needed for the conversion system.
|
![]() |
![]() |
#1228 |
Member
![]() Posts: 14
Karma: 10
Join Date: Aug 2009
Device: Kindle 2
|
You're welcome. Enjoy.
I've been meaning to come back to this recipe for some time. The text shows up as a lighter shade of grey (rather than black) on my Kindle 2. I imagine a quick recipe edit related to the style will fix it. Will get around to fixing it eventually. |
![]() |
![]() |
#1229 |
Junior Member
![]() Posts: 1
Karma: 10
Join Date: Jan 2010
Device: Sony PRS-505
|
Recipe for fr-online.de
Hi,
I wrote a recipe for fr-online.de which is from the German "Frankfurter Rundschau" Code:
import re from calibre.web.feeds.news import BasicNewsRecipe __license__ = 'GPL v3' __copyright__ = '2009, Justus Bisser <justus.bisser at gmail.com>' ''' fr-online.de ''' from calibre.web.feeds.news import BasicNewsRecipe class Spiegel_ger(BasicNewsRecipe): title = 'Frankfurter Rundschau' __author__ = 'Justus Bisser' description = "Dies ist die Online-Ausgabe der Frankfurter Rundschau. Um die abgerufenen individuell einzustellen bearbeiten sie die Liste im erweiterten Modus. Die Feeds findet man auf http://www.fr-online.de/verlagsservice/fr_newsreader/?em_cnt=574255" publisher = 'Druck- und Verlagshaus Frankfurt am Main GmbH' category = 'FR Online, Frankfurter Rundschau, Nachrichten, News,Dienste, RSS, RSS, Feedreader, Newsfeed, iGoogle, Netvibes, Widget' oldest_article = 7 max_articles_per_feed = 100 language = 'de' lang = 'de-DE' no_stylesheets = True use_embedded_content = False #encoding = 'cp1252' conversion_options = { 'comment' : description , 'tags' : category , 'publisher' : publisher , 'language' : lang } recursions = 0 max_articles_per_feed = 100 #keep_only_tags = [dict(name='div', attrs={'class':'text'})] #tags_remove = [dict(name='div', attrs={'style':'text-align: left; margin: 4px 0px 0px 4px; width: 200px; float: right;'})] remove_attributes = ['style'] feeds = [] #remove_tags_before = [dict(name='div', attrs={'style':'padding-left: 0px;'})] #remove_tags_after = [dict(name='div', attrs={'class':'box_head_text'})] # enable for all news allNews = 0 if allNews: feeds = [(u'Frankfurter Rundschau', u'http://www.fr-online.de/rss/sport/index.xml')] else: #select the feeds you like feeds = [(u'Nachrichten', u'http://www.fr-online.de/rss/politik/index.xml')] feeds.append((u'Kommentare und Analysen', u'http://www.fr-online.de/rss/meinung/index.xml')) feeds.append((u'Dokumentationen', u'http://www.fr-online.de/rss/dokumentation/index.xml')) feeds.append((u'Deutschlandtrend', u'http://www.fr-online.de/rss/deutschlandtrend/index.xml')) feeds.append((u'Wirtschaft', u'http://www.fr-online.de/rss/wirtschaft/index.xml')) feeds.append((u'Sport', u'http://www.fr-online.de/rss/sport/index.xml')) feeds.append((u'Feuilleton', u'http://www.fr-online.de/rss/feuilleton/index.xml')) feeds.append((u'Panorama', u'http://www.fr-online.de/rss/panorama/index.xml')) feeds.append((u'Rhein Main und Hessen', u'http://www.fr-online.de/rss/hessen/index.xml')) feeds.append((u'Fitness und Gesundheit', u'http://www.fr-online.de/rss/fit/index.xml')) feeds.append((u'Multimedia', u'http://www.fr-online.de/rss/multimedia/index.xml')) feeds.append((u'Wissen und Bildung', u'http://www.fr-online.de/rss/wissen/index.xml')) def get_article_url(self, article): #string = article.link #string = string.replace('0C', '/') #string = string.replace('0I', '_') #string = string.replace('0E', '-') #string = string.replace('0B', '.') #string = string[string.find("fr-online.de"):] #string = "http://www." + string #return string url = article.link #url = url.replace('0A', '0') #url = url.replace('0I', '_') regex = re.compile("0C[0-9]{6,8}0A?") liste = regex.findall(url) string = liste.pop(0) string = string[2:len(string)-1] return "http://www.fr-online.de/_em_cms/_globals/print.php?em_cnt=" + string |
![]() |
![]() |
#1230 |
Connoisseur
![]() ![]() Posts: 82
Karma: 118
Join Date: Dec 2005
Device: Kindle 2
|
I would like a recipe for The Week. The rss feeds can be found at http://www.theweek.com/home/sitemap. Can anyone help?
|
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Custom column read ? | pchrist7 | Calibre | 2 | 10-04-2010 02:52 AM |
Archive for custom screensavers | sleeplessdave | Amazon Kindle | 1 | 07-07-2010 12:33 PM |
How to back up preferences and custom recipes? | greenapple | Calibre | 3 | 03-29-2010 05:08 AM |
Donations for Custom Recipes | ddavtian | Calibre | 5 | 01-23-2010 04:54 PM |
Help understanding custom recipes | andersent | Calibre | 0 | 12-17-2009 02:37 PM |