![]() |
#1 |
Member
![]() Posts: 10
Karma: 10
Join Date: Aug 2011
Location: Toronto, Canada
Device: Kindle 3
|
Custom Instapaper Recipe
UPDATE: Check below for the latest version of the recipe. Feel free to give it a try
![]() Hi ![]() I'm new to the ebook scene, but I have stumbled across Calibre and it is pretty amazing. Kudos to all the developers! I've been trying to find a way to have my unread instapaper articles downloaded, placed into an ebook and then marked Archived on the instapaper site. From what I can tell this seems possible through a recipe and Instapaper APIs. I have tried Darko's recipe but right now it is only fetching the most recent 40 articles (about the same number that appear on the first page of instapaper). I was hoping to be able to download all my articles. This recipe also doesn't archive articles after they're downloaded. I also noticed that Calibre now has auto-clean using readability which I would prefer over the instapaper text-only feature. If it's basic enough that someone could write it for me, that would be great. Otherwise, if anyone could give me clues as to where to start, Python resources to read, that would be really appreciated too. I'd love learn ![]() I have a little experience with programming (C, HTML), but nothing with python. Thanks in advance for any help! ![]() Newest Recipe (01.09.2011) Code:
import urllib from calibre import strftime from calibre.web.feeds.news import BasicNewsRecipe class AdvancedUserRecipe1299694372(BasicNewsRecipe): title = u'Instapaper Recipe' __author__ = 'Darko Miletic' publisher = 'Instapaper.com' category = 'info, custom, Instapaper' oldest_article = 365 max_articles_per_feed = 100 auto_cleanup=True ###Have the articles downloaded in reverse order so that the oldest articles appear first. reverse_article_order=True needs_subscription = True INDEX = u'http://www.instapaper.com' LOGIN = INDEX + u'/user/login' ###6 Pages of Articles downloaded to ensure none are missed. 6 Pages = 240 Articles. ###Page order is reversed to ensure that oldest articles are downloaded first. feeds = [ (u'Instapaper Unread - Pg. 6 ', u'http://www.instapaper.com/u/6'), (u'Instapaper Unread - Pg. 5', u'http://www.instapaper.com/u/5'), (u'Instapaper Unread - Pg. 4', u'http://www.instapaper.com/u/4'), (u'Instapaper Unread - Pg. 3', u'http://www.instapaper.com/u/3'), (u'Instapaper Unread - Pg. 2', u'http://www.instapaper.com/u/2'), (u'Instapaper Unread - Pg. 1', u'http://www.instapaper.com/u/1'), (u'Instapaper Starred', u'http://www.instapaper.com/starred') ] def get_browser(self): br = BasicNewsRecipe.get_browser() if self.username is not None: br.open(self.LOGIN) br.select_form(nr=0) br['username'] = self.username if self.password is not None: br['password'] = self.password br.submit() return br def parse_index(self): totalfeeds = [] lfeeds = self.get_feeds() for feedobj in lfeeds: feedtitle, feedurl = feedobj self.report_progress(0, _('Fetching feed')+' %s...'%(feedtitle if feedtitle else feedurl)) articles = [] soup = self.index_to_soup(feedurl) self.myFormKey = soup.find('input', attrs={'name': 'form_key'})['value'] for item in soup.findAll('div', attrs={'class':'titleRow'}): #description = self.tag_to_string(item.div) atag = item.a if atag and atag.has_key('href'): url = atag['href'] articles.append({ 'url' :url }) totalfeeds.append((feedtitle, articles)) return totalfeeds #### Delete "#" to have the recipe archive all your unread articles after downloading. #def cleanup(self): # params = urllib.urlencode(dict(form_key=self.myFormKey, submit="Archive All")) # self.browser.open("http://www.instapaper.com/bulk-archive", params) def print_version(self, url): return url def populate_article_metadata(self, article, soup, first): article.title = soup.find('title').contents[0].strip() def postprocess_html(self, soup, first_fetch): for link_tag in soup.findAll(attrs={"id" : "story"}): link_tag.insert(0,'<h1>'+soup.find('title').contents[0].strip()+'</h1>') return soup This is a modified version of Darko's original Instapaper recipe. Changes: - Multiple pages of articles downloaded, not just the first 40. - Article order is reversed so oldest articles appear first. (Cred. Cendalc) - Ability to have all articles archived after they are downloaded. This stops Calibre from downloading the same articles over and over. (Delete "#"s to enable) (Cred. Cendalc, Banjopicker) - Original web-content is downloaded and simplified with readability rather then instapaper's text only feature. This works better in my experience. No more problems with some webpages not downloading. Known Bugs: - Less images downloading then before. (Some may prefer this as it saves space...) - All articles archived rather then just the ones downloaded. Test it and give me feedback. I have no python experience so this might be messy :P. Thanks! ![]() Last edited by haroldtreen; 09-01-2011 at 05:48 PM. |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,195
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
See the writing your own recipe section of https://www.mobileread.com/forums/sho...d.php?t=121439
To use readability, just add auto_cleanup = True to your recipe. |
![]() |
![]() |
![]() |
#3 |
Member
![]() Posts: 10
Karma: 10
Join Date: Aug 2011
Location: Toronto, Canada
Device: Kindle 3
|
Thanks Kovid!
This is where I am so far... Code:
import urllib from calibre import strftime from calibre.web.feeds.news import BasicNewsRecipe class AdvancedUserRecipe1299694372(BasicNewsRecipe): title = u'InstapaperAuto' __author__ = 'Darko Miletic' publisher = 'Instapaper.com' category = 'info, custom, Instapaper' oldest_article = 365 max_articles_per_feed = 100 auto_cleanup=True reverse_article_order = True needs_subscription = True INDEX = u'http://www.instapaper.com' LOGIN = INDEX + u'/user/login' feeds = [ (u'Instapaper Unread - Pg. 6 ', u'http://www.instapaper.com/u/6'), (u'Instapaper Unread - Pg. 5', u'http://www.instapaper.com/u/5'), (u'Instapaper Unread - Pg. 4', u'http://www.instapaper.com/u/4'), (u'Instapaper Unread - Pg. 3', u'http://www.instapaper.com/u/3'), (u'Instapaper Unread - Pg. 2', u'http://www.instapaper.com/u/2'), (u'Instapaper Unread - Pg. 1', u'http://www.instapaper.com/u/1'), (u'Instapaper Starred', u'http://www.instapaper.com/starred') ] def get_browser(self): br = BasicNewsRecipe.get_browser() if self.username is not None: br.open(self.LOGIN) br.select_form(nr=0) br['username'] = self.username if self.password is not None: br['password'] = self.password br.submit() return br def parse_index(self): totalfeeds = [] lfeeds = self.get_feeds() for feedobj in lfeeds: feedtitle, feedurl = feedobj self.report_progress(0, _('Fetching feed')+' %s...'%(feedtitle if feedtitle else feedurl)) articles = [] soup = self.index_to_soup(feedurl) self.myFormKey = soup.find('input', attrs={'name': 'form_key'})['value'] for item in soup.findAll('div', attrs={'class':'cornerControls'}): description = self.tag_to_string(item.div) atag = item.a if atag and atag.has_key('href'): url = atag['href'] articles.append({ 'url' :url }) totalfeeds.append((feedtitle, articles)) return totalfeeds def cleanup(self): params = urllib.urlencode(dict(form_key=self.myFormKey, submit="Archive All")) self.browser.open("http://www.instapaper.com/bulk-archive", params) def print_version(self, url): return 'http://www.instapaper.com' + url def populate_article_metadata(self, article, soup, first): article.title = soup.find('title').contents[0].strip() def postprocess_html(self, soup, first_fetch): for link_tag in soup.findAll(attrs={"id" : "story"}): link_tag.insert(0,'<h1>'+soup.find('title').contents[0].strip()+'</h1>') return soup Changes: - I added feeds for 6 unread pages instead of 1. I only have 5 pages, but adding 6 leaves room in case I get more. When I open the file on my kindle, only 5 sections are displayed, so it omits empty ones. I like the 5 sections of 40 articles rather then 1 section of 200 articles. - Added AutoClean=True. This decrease the size of the download from 4mb to 2.7mb. There's a lot less useless photos. - Implemented the "Archive All" modification that cendalc/banjopicker created (https://www.mobileread.com/forums/sho...8&postcount=13) Update - Added "reverse_article_order = True" (Cred: Cendalc) and switched the order of the feeds so that older articles appear first. That way reading can be done in chronological order. Comments: I sort of patched this together with trial and error. The parts from def parse_index to the end still confuse me. I believe the autoclean feature is cleaning already created text version that instapaper creates. Is that true? If so, how would I go about making program open the links in the feed and then apply the autoclean directly to the webpages themselves? I find that instapapers text feature gives a few too many "Page not available's" and that readability is a bit better. Lastly, the archive all feature is fine, but is their a way to archive pages as they are opened and pakaged? That way if someone wanted to download only a few articles, their entire collection wouldn't be archived. Thanks for any feedback! (This recipe stuff is cool! ![]() Last edited by haroldtreen; 09-01-2011 at 05:33 PM. Reason: Added article reverse |
![]() |
![]() |
![]() |
#4 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
|
![]() |
![]() |
![]() |
#5 |
Member
![]() Posts: 10
Karma: 10
Join Date: Aug 2011
Location: Toronto, Canada
Device: Kindle 3
|
It's a lot of the variables such as "lfeeds", "feedobj", "self" (what's the importance of it being enclosed in parse_index() ).
I think because I only know C, that I'm confused why these things that look like variables aren't defined. I think it would be useful to have a more complex recipe like this with #comments beside things explaining what things are doing. And def is to call a defined function? I really need to pickup a good book on python :P. |
![]() |
![]() |
![]() |
#6 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,195
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
autoclean works on whatever is downloaded. If your recipe is downloading plain text then it will run on that. If you want to run it on the orginal you will need to modify the recipe to download the original html. Not being an instapaper user, I can't tell you how to do that.
|
![]() |
![]() |
![]() |
#7 |
Member
![]() Posts: 10
Karma: 10
Join Date: Aug 2011
Location: Toronto, Canada
Device: Kindle 3
|
Thanks again Kovid!
So I realized that the code is looking for HTML tags that hold the info you want to clean. So even though I don't know what lots of the python means, I can see somewhat what is going on. I changed Darko's code to pull the URL from the website, which is then cleaned with the AutoClean feature. With that, I believe this recipe does exactly what I want now. 1) Pull all unread articles from Instapaper 2) Download a readability version of each article 3) Archive all the articles As of now the only problems are 1) Anyone with more then 6 pages of unread articles won't get ALL their articles 2) All articles are archived as part of the cleanup. Their should be a way to select the archive option after the URL is fetched...but that sort of python is beyond me... if any developers knows how I would love to see it. 3) Articles downloaded with this recipe seem to have fewer images then before... I looked at 1 webpage in three ways to see what might be up. - When downloaded with the recipe it has no images - When taken from the "text only feature" of instapaper it contains multiple images (although many which weren't meant to be part of the article). - When taken with readability inside chrome it shows correctly with 1 image. This is me being a perfectionist though. As long as all the content gets downloaded, I'm happy. I'm going to post the new recipe in my post above. I'll include # comments so others with no coding background can modify it to their liking. Cheers! ![]() |
![]() |
![]() |
![]() |
#8 |
Junior Member
![]() Posts: 7
Karma: 10
Join Date: Oct 2011
Device: Nook Touch
|
TOC
Hi! This is my first post... Hope my English is good enough, as I'm italian!
![]() First of all, I'd like to thank haroldtreen for his recipe. It works smoothly! I just have one question. As I switched from a Kindle to a Nook Touch, I experienced a simple but annoying problem. The ePub made with his recipe has a two-level TOC, but my Nook only reads the first level. How can I get my articles as first-level in an ePub TOC? Last edited by mojofleur; 06-23-2012 at 07:21 AM. |
![]() |
![]() |
![]() |
#9 |
Member Retired
![]() Posts: 23
Karma: 40
Join Date: Sep 2011
Device: Android
|
Very cool recipe! Although it doesn't seem to auto-archive the downloaded articles on my Instapaper account though. Oh well, I can always manually do it from the site.
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Instapaper - Updated recipe | khromov | Recipes | 78 | 01-23-2015 01:09 AM |
Custom recipe question | jdomingos76 | Recipes | 1 | 02-10-2011 07:46 AM |
Need help regarding custom recipe | gagsays | Calibre | 3 | 05-26-2010 07:48 PM |
Custom Recipe | CABITSS | Introduce Yourself | 2 | 09-22-2009 10:30 AM |
Custom Recipe | CABITSS | Calibre | 3 | 09-22-2009 10:29 AM |