12-28-2012, 11:20 PM | #1 |
Connoisseur
Posts: 55
Karma: 13316
Join Date: Jul 2012
Device: iPad
|
Harper's Print Edition recipe update
This is an update on Darko Miletic's great work. What I worked on include updates on cover image processing, getting the current issue and some misc/minor things.
I changed the title to Harper's Magazine - Print Edition to fork mostly as I didn't get a chance to communicate with the original author. R Code:
__license__ = 'GPL v3' __copyright__ = '2008-2012, Darko Miletic <darko.miletic at gmail.com>' ''' harpers.org - paid subscription/ printed issue articles This recipe only get's article's published in text format images and pdf's are ignored If you have institutional subscription based on access IP you do not need to enter anything in username/password fields ''' import time import urllib from calibre import strftime from calibre.web.feeds.news import BasicNewsRecipe class Harpers_full(BasicNewsRecipe): title = "Harper's Magazine - Printed Edition" __author__ = 'Darko Miletic' description = "Harper's Magazine, the oldest general-interest monthly in America, explores the issues that drive our national conversation, through long-form narrative journalism and essays, and such celebrated features as the iconic Harper's Index." publisher = "Harpers's" category = 'news, politics, USA' oldest_article = 30 max_articles_per_feed = 100 no_stylesheets = True use_embedded_content = False delay = 1 language = 'en' encoding = 'utf8' needs_subscription = 'optional' masthead_url = 'http://harpers.org/wp-content/themes/harpers/images/pheader.gif' publication_type = 'magazine' INDEX = '' LOGIN = 'http://harpers.org/wp-content/themes/harpers/ajax_login.php' extra_css = """ body{font-family: adobe-caslon-pro,serif} .category{font-size: small} .articlePost p:first-letter{display: inline; font-size: xx-large; font-weight: bold} """ conversion_options = { 'comment' : description , 'tags' : category , 'publisher' : publisher , 'language' : language } keep_only_tags = [ dict(name='div', attrs={'class':['postdetailFull','articlePost']}) ] remove_tags = [ dict(name='div', attrs={'class':'fRight rightDivPad'}) ,dict(name=['link','meta','object','embed','iframe']) ] remove_attributes=['xmlns'] def get_browser(self): br = BasicNewsRecipe.get_browser() br.open('http://harpers.org/') if self.username is not None and self.password is not None: tt = time.localtime()*1000 data = urllib.urlencode({ 'm':self.username ,'p':self.password ,'rt':'http://harpers.org/' ,'tt':tt }) br.open(self.LOGIN, data) return br def parse_index(self): #find current issue soup = self.index_to_soup('http://harpers.org/') currentIssue=soup.find('div',attrs={'class':'mainNavi'}).find('li',attrs={'class':'curentIssue'}) currentIssue_url=self.tag_to_string(currentIssue.a['href']) self.log(currentIssue_url) #go to the current issue soup1 = self.index_to_soup(currentIssue_url) date = re.split('\s\|\s',self.tag_to_string(soup1.head.title.string))[0] self.timefmt = u' [%s]'%date #get cover coverurl='http://harpers.org/wp-content/themes/harpers/ajax_microfiche.php?img=harpers-'+re.split('harpers.org/',currentIssue_url)[1]+'gif/0001.gif' soup2 = self.index_to_soup(coverurl) self.cover_url = self.tag_to_string(soup2.find('img')['src']) self.log(self.cover_url) articles = [] count = 0 for item in soup1.findAll('div', attrs={'class':'articleData'}): text_links = item.findAll('h2') for text_link in text_links: if count == 0: count = 1 else: url = text_link.a['href'] title = text_link.a.contents[0] date = strftime(' %B %Y') articles.append({ 'title' :title ,'date' :date ,'url' :url ,'description':'' }) return [(soup1.head.title.string, articles)] def print_version(self, url): return url + '?single=1' def cleanup(self): soup = self.index_to_soup('http://harpers.org/') signouturl=self.tag_to_string(soup.find('li', attrs={'class':'subLogOut'}).findNext('li').a['href']) self.log(signouturl) self.browser.open(signouturl) |
12-29-2012, 11:51 AM | #2 |
Guru
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
|
Dude I already posted this update to calibre bug tracker.
|
Advert | |
|
03-25-2013, 06:30 PM | #3 |
Connoisseur
Posts: 55
Karma: 13316
Join Date: Jul 2012
Device: iPad
|
Update: fixed cover image.
Code:
__license__ = 'GPL v3' __copyright__ = '2008-2012, Darko Miletic <darko.miletic at gmail.com>' ''' harpers.org - paid subscription/ printed issue articles This recipe only get's article's published in text format images and pdf's are ignored If you have institutional subscription based on access IP you do not need to enter anything in username/password fields ''' import time import urllib from calibre import strftime from calibre.web.feeds.news import BasicNewsRecipe class Harpers_full(BasicNewsRecipe): title = "Harper's Magazine - Printed Edition" __author__ = 'Darko Miletic' description = "Harper's Magazine, the oldest general-interest monthly in America, explores the issues that drive our national conversation, through long-form narrative journalism and essays, and such celebrated features as the iconic Harper's Index." publisher = "Harpers's" category = 'news, politics, USA' oldest_article = 30 max_articles_per_feed = 100 no_stylesheets = True use_embedded_content = False delay = 1 language = 'en' encoding = 'utf8' needs_subscription = 'optional' masthead_url = 'http://harpers.org/wp-content/themes/harpers/images/pheader.gif' publication_type = 'magazine' INDEX = '' LOGIN = 'http://harpers.org/wp-content/themes/harpers/ajax_login.php' extra_css = """ body{font-family: adobe-caslon-pro,serif} .category{font-size: small} .articlePost p:first-letter{display: inline; font-size: xx-large; font-weight: bold} """ conversion_options = { 'comment' : description , 'tags' : category , 'publisher' : publisher , 'language' : language } keep_only_tags = [ dict(name='div', attrs={'class':['postdetailFull','articlePost']}) ] remove_tags = [ dict(name='div', attrs={'class':'fRight rightDivPad'}) ,dict(name=['link','meta','object','embed','iframe']) ] remove_attributes=['xmlns'] def get_browser(self): br = BasicNewsRecipe.get_browser(self) br.open('http://harpers.org/') if self.username is not None and self.password is not None: tt = time.localtime()*1000 data = urllib.urlencode({ 'm':self.username ,'p':self.password ,'rt':'http://harpers.org/' ,'tt':tt }) br.open(self.LOGIN, data) return br def parse_index(self): #find current issue soup = self.index_to_soup('http://harpers.org/') currentIssue=soup.find('div',attrs={'class':'mainNavi'}).find('li',attrs={'class':'curentIssue'}) currentIssue_url=self.tag_to_string(currentIssue.a['href']) #go to the current issue soup1 = self.index_to_soup(currentIssue_url) date = re.split('\s\|\s',self.tag_to_string(soup1.head.title.string))[0] self.timefmt = u' [%s]'%date #get cover self.cover_url = soup1.find('div', attrs = {'class':'picture_hp'}).find('img', src=True)['src'] articles = [] count = 0 for item in soup1.findAll('div', attrs={'class':'articleData'}): text_links = item.findAll('h2') for text_link in text_links: if count == 0: count = 1 else: url = text_link.a['href'] title = text_link.a.contents[0] date = strftime(' %B %Y') articles.append({ 'title' :title ,'date' :date ,'url' :url ,'description':'' }) return [(soup1.head.title.string, articles)] def print_version(self, url): return url + '?single=1' def cleanup(self): soup = self.index_to_soup('http://harpers.org/') signouturl=self.tag_to_string(soup.find('li', attrs={'class':'subLogOut'}).findNext('li').a['href']) self.log(signouturl) self.browser.open(signouturl) |
03-29-2013, 09:17 PM | #4 |
Guru
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
|
Please stop posting code with cleanup part. There is absolutely no need to perform logout. Just a waste of resources.
|
03-30-2013, 06:44 AM | #5 |
Guru
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
|
Also your update, again, had older version of the code. In future either make edit on the version of the recipe shipped with calibre (not the custom one you have) or just sumbit the changes to me and I'll do it.
|
Advert | |
|
04-04-2013, 10:38 AM | #6 |
Connoisseur
Posts: 55
Karma: 13316
Join Date: Jul 2012
Device: iPad
|
Sure I will. Sorry.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Help: FT UK print edition not downloading | Harrydogood | Recipes | 0 | 02-11-2012 11:19 AM |
Financial Times Print Edition - Sub sections | ratulb | Recipes | 1 | 11-27-2010 08:05 AM |
Financial Times / FT - help creating a UK print edition recipe | ndeb123 | Recipes | 1 | 09-29-2010 10:55 AM |
Problems with RSS feeds conversion (URLpath not different in the print edition) | DerOberdada | Calibre | 2 | 01-21-2010 12:37 PM |