Custom Instapaper Recipe

haroldtreen · 08-30-2011, 01:18 PM

UPDATE: Check below for the latest version of the recipe. Feel free to give it a try

Hi

,

I'm new to the ebook scene, but I have stumbled across Calibre and it is pretty amazing. Kudos to all the developers!

I've been trying to find a way to have my unread instapaper articles downloaded, placed into an ebook and then marked Archived on the instapaper site. From what I can tell this seems possible through a recipe and Instapaper APIs.

I have tried Darko's recipe but right now it is only fetching the most recent 40 articles (about the same number that appear on the first page of instapaper). I was hoping to be able to download all my articles. This recipe also doesn't archive articles after they're downloaded.

I also noticed that Calibre now has auto-clean using readability which I would prefer over the instapaper text-only feature.

If it's basic enough that someone could write it for me, that would be great. Otherwise, if anyone could give me clues as to where to start, Python resources to read, that would be really appreciated too. I'd love learn

.

I have a little experience with programming (C, HTML), but nothing with python.

Thanks in advance for any help!

Newest Recipe (01.09.2011)

Code:

import urllib
from calibre import strftime
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1299694372(BasicNewsRecipe):
    title                             = u'Instapaper Recipe'
    __author__                  = 'Darko Miletic'
    publisher                     = 'Instapaper.com'
    category                      = 'info, custom, Instapaper'
    oldest_article               = 365
    max_articles_per_feed = 100
    auto_cleanup=True
    
    ###Have the articles downloaded in reverse order so that the oldest articles appear first.
    reverse_article_order=True
    
    needs_subscription    = True
    INDEX                 = u'http://www.instapaper.com'
    LOGIN                 = INDEX + u'/user/login'

###6 Pages of Articles downloaded to ensure none are missed. 6 Pages = 240 Articles. 
###Page order is reversed to ensure that oldest articles are downloaded first.

    feeds          = [
            (u'Instapaper Unread - Pg. 6 ', u'http://www.instapaper.com/u/6'),
            (u'Instapaper Unread - Pg. 5', u'http://www.instapaper.com/u/5'),
            (u'Instapaper Unread - Pg. 4', u'http://www.instapaper.com/u/4'),
            (u'Instapaper Unread - Pg. 3', u'http://www.instapaper.com/u/3'),
            (u'Instapaper Unread - Pg. 2', u'http://www.instapaper.com/u/2'),
            (u'Instapaper Unread - Pg. 1', u'http://www.instapaper.com/u/1'),
            (u'Instapaper Starred', u'http://www.instapaper.com/starred')
            ]

    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None:
            br.open(self.LOGIN)
            br.select_form(nr=0)
            br['username'] = self.username
            if self.password is not None:
               br['password'] = self.password
            br.submit()
        return br

    def parse_index(self):
        totalfeeds = []
        lfeeds = self.get_feeds()
        for feedobj in lfeeds:
            feedtitle, feedurl = feedobj
            self.report_progress(0, _('Fetching feed')+' %s...'%(feedtitle if feedtitle else feedurl))
            articles = []
            soup = self.index_to_soup(feedurl)
            self.myFormKey = soup.find('input', attrs={'name': 'form_key'})['value']
            for item in soup.findAll('div', attrs={'class':'titleRow'}):
                #description = self.tag_to_string(item.div)
                atag = item.a
                if atag and atag.has_key('href'):
                    url         = atag['href']
                    articles.append({
                                     'url'        :url
                                    })
            totalfeeds.append((feedtitle, articles))
        return totalfeeds

#### Delete "#" to have the recipe archive all your unread articles after downloading.
    #def cleanup(self):
      #  params = urllib.urlencode(dict(form_key=self.myFormKey, submit="Archive All"))
      #  self.browser.open("http://www.instapaper.com/bulk-archive", params)

    def print_version(self, url):
         return url

    def populate_article_metadata(self, article, soup, first):
        article.title  = soup.find('title').contents[0].strip()

    def postprocess_html(self, soup, first_fetch):
        for link_tag in soup.findAll(attrs={"id" : "story"}):
            link_tag.insert(0,'<h1>'+soup.find('title').contents[0].strip()+'</h1>')

        return soup

This is a modified version of Darko's original Instapaper recipe.

Changes:

- Multiple pages of articles downloaded, not just the first 40.
- Article order is reversed so oldest articles appear first. (Cred. Cendalc)
- Ability to have all articles archived after they are downloaded. This stops Calibre from downloading the same articles over and over. (Delete "#"s to enable) (Cred. Cendalc, Banjopicker)
- Original web-content is downloaded and simplified with readability rather then instapaper's text only feature. This works better in my experience. No more problems with some webpages not downloading.

Known Bugs:

- Less images downloading then before. (Some may prefer this as it saves space...)
- All articles archived rather then just the ones downloaded.

Test it and give me feedback. I have no python experience so this might be messy :P.

Thanks!

kovidgoyal · 08-30-2011, 01:33 PM

See the writing your own recipe section of https://www.mobileread.com/forums/sho...d.php?t=121439

To use readability, just add

auto_cleanup = True

to your recipe.

haroldtreen · 08-30-2011, 03:57 PM

Thanks Kovid!

This is where I am so far...

Code:

import urllib
from calibre import strftime
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1299694372(BasicNewsRecipe):
    title                             = u'InstapaperAuto'
    __author__                  = 'Darko Miletic'
    publisher                     = 'Instapaper.com'
    category                      = 'info, custom, Instapaper'
    oldest_article               = 365
    max_articles_per_feed = 100
    auto_cleanup=True
    reverse_article_order = True
    
    needs_subscription    = True
    INDEX                 = u'http://www.instapaper.com'
    LOGIN                 = INDEX + u'/user/login'


    feeds          = [
            (u'Instapaper Unread - Pg. 6 ', u'http://www.instapaper.com/u/6'),
            (u'Instapaper Unread - Pg. 5', u'http://www.instapaper.com/u/5'),
            (u'Instapaper Unread - Pg. 4', u'http://www.instapaper.com/u/4'),
            (u'Instapaper Unread - Pg. 3', u'http://www.instapaper.com/u/3'),
            (u'Instapaper Unread - Pg. 2', u'http://www.instapaper.com/u/2'),
            (u'Instapaper Unread - Pg. 1', u'http://www.instapaper.com/u/1'),
            (u'Instapaper Starred', u'http://www.instapaper.com/starred')
            ]

    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None:
            br.open(self.LOGIN)
            br.select_form(nr=0)
            br['username'] = self.username
            if self.password is not None:
               br['password'] = self.password
            br.submit()
        return br

     def parse_index(self):
        totalfeeds = []
        lfeeds = self.get_feeds()
        for feedobj in lfeeds:
            feedtitle, feedurl = feedobj
            self.report_progress(0, _('Fetching feed')+' %s...'%(feedtitle if feedtitle else feedurl))
            articles = []
            soup = self.index_to_soup(feedurl)
            self.myFormKey = soup.find('input', attrs={'name': 'form_key'})['value']
            for item in soup.findAll('div', attrs={'class':'cornerControls'}):
                description = self.tag_to_string(item.div)
                atag = item.a
                if atag and atag.has_key('href'):
                    url         = atag['href']
                    articles.append({
                                     'url'        :url
                                    })
            totalfeeds.append((feedtitle, articles))
        return totalfeeds

    def cleanup(self):
        params = urllib.urlencode(dict(form_key=self.myFormKey, submit="Archive All"))
        self.browser.open("http://www.instapaper.com/bulk-archive", params)

    def print_version(self, url):
        return 'http://www.instapaper.com' + url

    def populate_article_metadata(self, article, soup, first):
        article.title  = soup.find('title').contents[0].strip()

    def postprocess_html(self, soup, first_fetch):
        for link_tag in soup.findAll(attrs={"id" : "story"}):
            link_tag.insert(0,'<h1>'+soup.find('title').contents[0].strip()+'</h1>')

        return soup

This is Darko's recipe that I modified.

Changes:

- I added feeds for 6 unread pages instead of 1. I only have 5 pages, but adding 6 leaves room in case I get more. When I open the file on my kindle, only 5 sections are displayed, so it omits empty ones. I like the 5 sections of 40 articles rather then 1 section of 200 articles.

- Added AutoClean=True. This decrease the size of the download from 4mb to 2.7mb. There's a lot less useless photos.

- Implemented the "Archive All" modification that cendalc/banjopicker created (https://www.mobileread.com/forums/sho...8&postcount=13)

Update - Added "reverse_article_order = True" (Cred: Cendalc) and switched the order of the feeds so that older articles appear first. That way reading can be done in chronological order.

Comments:
I sort of patched this together with trial and error. The parts from def parse_index to the end still confuse me.

I believe the autoclean feature is cleaning already created text version that instapaper creates. Is that true?

If so, how would I go about making program open the links in the feed and then apply the autoclean directly to the webpages themselves? I find that instapapers text feature gives a few too many "Page not available's" and that readability is a bit better.

Lastly, the archive all feature is fine, but is their a way to archive pages as they are opened and pakaged? That way if someone wanted to download only a few articles, their entire collection wouldn't be archived.

Thanks for any feedback!

(This recipe stuff is cool!

)

Starson17 · 08-30-2011, 04:23 PM

Quote:

Originally Posted by haroldtreen

The parts from def parse_index to the end still confuse me.

Do you have any specific questions? Is it the content of def parse_index that confuses you or the def statements after that?

haroldtreen · 08-30-2011, 04:58 PM

It's a lot of the variables such as "lfeeds", "feedobj", "self" (what's the importance of it being enclosed in parse_index() ).

I think because I only know C, that I'm confused why these things that look like variables aren't defined. I think it would be useful to have a more complex recipe like this with #comments beside things explaining what things are doing.

And def is to call a defined function?

I really need to pickup a good book on python :P.

kovidgoyal · 08-31-2011, 12:34 AM

autoclean works on whatever is downloaded. If your recipe is downloading plain text then it will run on that. If you want to run it on the orginal you will need to modify the recipe to download the original html. Not being an instapaper user, I can't tell you how to do that.

haroldtreen · 09-01-2011, 05:31 PM

Thanks again Kovid!

So I realized that the code is looking for HTML tags that hold the info you want to clean. So even though I don't know what lots of the python means, I can see somewhat what is going on.

I changed Darko's code to pull the URL from the website, which is then cleaned with the AutoClean feature.

With that, I believe this recipe does exactly what I want now.

1) Pull all unread articles from Instapaper
2) Download a readability version of each article
3) Archive all the articles

As of now the only problems are

1) Anyone with more then 6 pages of unread articles won't get ALL their articles

2) All articles are archived as part of the cleanup. Their should be a way to select the archive option after the URL is fetched...but that sort of python is beyond me... if any developers knows how I would love to see it.

3) Articles downloaded with this recipe seem to have fewer images then before...

I looked at 1 webpage in three ways to see what might be up.

- When downloaded with the recipe it has no images
- When taken from the "text only feature" of instapaper it contains multiple images (although many which weren't meant to be part of the article).
- When taken with readability inside chrome it shows correctly with 1 image.

This is me being a perfectionist though. As long as all the content gets downloaded, I'm happy.

I'm going to post the new recipe in my post above. I'll include # comments so others with no coding background can modify it to their liking.

Cheers!

mojofleur · 10-08-2011, 06:06 AM

Hi! This is my first post... Hope my English is good enough, as I'm italian!

First of all, I'd like to thank haroldtreen for his recipe. It works smoothly!

I just have one question. As I switched from a Kindle to a Nook Touch, I experienced a simple but annoying problem. The ePub made with his recipe has a two-level TOC, but my Nook only reads the first level.

How can I get my articles as first-level in an ePub TOC?

bosun120 · 03-25-2012, 03:48 AM

Very cool recipe! Although it doesn't seem to auto-archive the downloaded articles on my Instapaper account though. Oh well, I can always manually do it from the site.

joer2786 · 05-27-2025, 06:10 PM

Have spent a ton of time and cannot figure out where to input my username and my password into this recipe - it always shows an error saying it needs my username and password

10-08-2011, 06:06 AM	#8
mojofleur Junior Member Posts: 7 Karma: 10 Join Date: Oct 2011 Device: Nook Touch	TOC Hi! This is my first post... Hope my English is good enough, as I'm italian! First of all, I'd like to thank haroldtreen for his recipe. It works smoothly! I just have one question. As I switched from a Kindle to a Nook Touch, I experienced a simple but annoying problem. The ePub made with his recipe has a two-level TOC, but my Nook only reads the first level. How can I get my articles as first-level in an ePub TOC? Last edited by mojofleur; 06-23-2012 at 07:21 AM.

05-27-2025, 06:10 PM	#10
joer2786 Member Posts: 13 Karma: 10 Join Date: Nov 2024 Device: kindle	Where do I edit the recipe to put in username and password Have spent a ton of time and cannot figure out where to input my username and my password into this recipe - it always shows an error saying it needs my username and password

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Instapaper - Updated recipe	khromov	Recipes	78	01-23-2015 01:09 AM
Custom recipe question	jdomingos76	Recipes	1	02-10-2011 07:46 AM
Need help regarding custom recipe	gagsays	Calibre	3	05-26-2010 07:48 PM
Custom Recipe	CABITSS	Introduce Yourself	2	09-22-2009 10:30 AM
Custom Recipe	CABITSS	Calibre	3	09-22-2009 10:29 AM

08-30-2011, 01:33 PM	#2
kovidgoyal creator of calibre Posts: 45,345 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	See the writing your own recipe section of https://www.mobileread.com/forums/sho...d.php?t=121439 To use readability, just add auto_cleanup = True to your recipe.

08-30-2011, 04:58 PM	#5
haroldtreen Member Posts: 10 Karma: 10 Join Date: Aug 2011 Location: Toronto, Canada Device: Kindle 3	It's a lot of the variables such as "lfeeds", "feedobj", "self" (what's the importance of it being enclosed in parse_index() ). I think because I only know C, that I'm confused why these things that look like variables aren't defined. I think it would be useful to have a more complex recipe like this with #comments beside things explaining what things are doing. And def is to call a defined function? I really need to pickup a good book on python :P.

08-31-2011, 12:34 AM	#6
kovidgoyal creator of calibre Posts: 45,345 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	autoclean works on whatever is downloaded. If your recipe is downloading plain text then it will run on that. If you want to run it on the orginal you will need to modify the recipe to download the original html. Not being an instapaper user, I can't tell you how to do that.

09-01-2011, 05:31 PM	#7
haroldtreen Member Posts: 10 Karma: 10 Join Date: Aug 2011 Location: Toronto, Canada Device: Kindle 3	Thanks again Kovid! So I realized that the code is looking for HTML tags that hold the info you want to clean. So even though I don't know what lots of the python means, I can see somewhat what is going on. I changed Darko's code to pull the URL from the website, which is then cleaned with the AutoClean feature. With that, I believe this recipe does exactly what I want now. 1) Pull all unread articles from Instapaper 2) Download a readability version of each article 3) Archive all the articles As of now the only problems are 1) Anyone with more then 6 pages of unread articles won't get ALL their articles 2) All articles are archived as part of the cleanup. Their should be a way to select the archive option after the URL is fetched...but that sort of python is beyond me... if any developers knows how I would love to see it. 3) Articles downloaded with this recipe seem to have fewer images then before... I looked at 1 webpage in three ways to see what might be up. - When downloaded with the recipe it has no images - When taken from the "text only feature" of instapaper it contains multiple images (although many which weren't meant to be part of the article). - When taken with readability inside chrome it shows correctly with 1 image. This is me being a perfectionist though. As long as all the content gets downloaded, I'm happy. I'm going to post the new recipe in my post above. I'll include # comments so others with no coding background can modify it to their liking. Cheers!

03-25-2012, 03:48 AM	#9
bosun120 Member Retired Posts: 23 Karma: 40 Join Date: Sep 2011 Device: Android	Very cool recipe! Although it doesn't seem to auto-archive the downloaded articles on my Instapaper account though. Oh well, I can always manually do it from the site.