Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 08-30-2011, 01:18 PM   #1
haroldtreen
Member
haroldtreen began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Aug 2011
Location: Toronto, Canada
Device: Kindle 3
Custom Instapaper Recipe

UPDATE: Check below for the latest version of the recipe. Feel free to give it a try

Hi ,

I'm new to the ebook scene, but I have stumbled across Calibre and it is pretty amazing. Kudos to all the developers!

I've been trying to find a way to have my unread instapaper articles downloaded, placed into an ebook and then marked Archived on the instapaper site. From what I can tell this seems possible through a recipe and Instapaper APIs.

I have tried Darko's recipe but right now it is only fetching the most recent 40 articles (about the same number that appear on the first page of instapaper). I was hoping to be able to download all my articles. This recipe also doesn't archive articles after they're downloaded.

I also noticed that Calibre now has auto-clean using readability which I would prefer over the instapaper text-only feature.

If it's basic enough that someone could write it for me, that would be great. Otherwise, if anyone could give me clues as to where to start, Python resources to read, that would be really appreciated too. I'd love learn .

I have a little experience with programming (C, HTML), but nothing with python.

Thanks in advance for any help!



Newest Recipe (01.09.2011)

Code:
import urllib
from calibre import strftime
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1299694372(BasicNewsRecipe):
    title                             = u'Instapaper Recipe'
    __author__                  = 'Darko Miletic'
    publisher                     = 'Instapaper.com'
    category                      = 'info, custom, Instapaper'
    oldest_article               = 365
    max_articles_per_feed = 100
    auto_cleanup=True
    
    ###Have the articles downloaded in reverse order so that the oldest articles appear first.
    reverse_article_order=True
    
    needs_subscription    = True
    INDEX                 = u'http://www.instapaper.com'
    LOGIN                 = INDEX + u'/user/login'

###6 Pages of Articles downloaded to ensure none are missed. 6 Pages = 240 Articles. 
###Page order is reversed to ensure that oldest articles are downloaded first.

    feeds          = [
            (u'Instapaper Unread - Pg. 6 ', u'http://www.instapaper.com/u/6'),
            (u'Instapaper Unread - Pg. 5', u'http://www.instapaper.com/u/5'),
            (u'Instapaper Unread - Pg. 4', u'http://www.instapaper.com/u/4'),
            (u'Instapaper Unread - Pg. 3', u'http://www.instapaper.com/u/3'),
            (u'Instapaper Unread - Pg. 2', u'http://www.instapaper.com/u/2'),
            (u'Instapaper Unread - Pg. 1', u'http://www.instapaper.com/u/1'),
            (u'Instapaper Starred', u'http://www.instapaper.com/starred')
            ]

    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None:
            br.open(self.LOGIN)
            br.select_form(nr=0)
            br['username'] = self.username
            if self.password is not None:
               br['password'] = self.password
            br.submit()
        return br

    def parse_index(self):
        totalfeeds = []
        lfeeds = self.get_feeds()
        for feedobj in lfeeds:
            feedtitle, feedurl = feedobj
            self.report_progress(0, _('Fetching feed')+' %s...'%(feedtitle if feedtitle else feedurl))
            articles = []
            soup = self.index_to_soup(feedurl)
            self.myFormKey = soup.find('input', attrs={'name': 'form_key'})['value']
            for item in soup.findAll('div', attrs={'class':'titleRow'}):
                #description = self.tag_to_string(item.div)
                atag = item.a
                if atag and atag.has_key('href'):
                    url         = atag['href']
                    articles.append({
                                     'url'        :url
                                    })
            totalfeeds.append((feedtitle, articles))
        return totalfeeds

#### Delete "#" to have the recipe archive all your unread articles after downloading.
    #def cleanup(self):
      #  params = urllib.urlencode(dict(form_key=self.myFormKey, submit="Archive All"))
      #  self.browser.open("http://www.instapaper.com/bulk-archive", params)

    def print_version(self, url):
         return url

    def populate_article_metadata(self, article, soup, first):
        article.title  = soup.find('title').contents[0].strip()

    def postprocess_html(self, soup, first_fetch):
        for link_tag in soup.findAll(attrs={"id" : "story"}):
            link_tag.insert(0,'<h1>'+soup.find('title').contents[0].strip()+'</h1>')

        return soup

This is a modified version of Darko's original Instapaper recipe.

Changes:

- Multiple pages of articles downloaded, not just the first 40.
- Article order is reversed so oldest articles appear first. (Cred. Cendalc)
- Ability to have all articles archived after they are downloaded. This stops Calibre from downloading the same articles over and over. (Delete "#"s to enable) (Cred. Cendalc, Banjopicker)
- Original web-content is downloaded and simplified with readability rather then instapaper's text only feature. This works better in my experience. No more problems with some webpages not downloading.


Known Bugs:

- Less images downloading then before. (Some may prefer this as it saves space...)
- All articles archived rather then just the ones downloaded.

Test it and give me feedback. I have no python experience so this might be messy :P.

Thanks!

Last edited by haroldtreen; 09-01-2011 at 05:48 PM.
haroldtreen is offline   Reply With Quote
Old 08-30-2011, 01:33 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,196
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
See the writing your own recipe section of https://www.mobileread.com/forums/sho...d.php?t=121439

To use readability, just add

auto_cleanup = True

to your recipe.
kovidgoyal is offline   Reply With Quote
Old 08-30-2011, 03:57 PM   #3
haroldtreen
Member
haroldtreen began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Aug 2011
Location: Toronto, Canada
Device: Kindle 3
Thanks Kovid!

This is where I am so far...

Code:
import urllib
from calibre import strftime
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1299694372(BasicNewsRecipe):
    title                             = u'InstapaperAuto'
    __author__                  = 'Darko Miletic'
    publisher                     = 'Instapaper.com'
    category                      = 'info, custom, Instapaper'
    oldest_article               = 365
    max_articles_per_feed = 100
    auto_cleanup=True
    reverse_article_order = True
    
    needs_subscription    = True
    INDEX                 = u'http://www.instapaper.com'
    LOGIN                 = INDEX + u'/user/login'


    feeds          = [
            (u'Instapaper Unread - Pg. 6 ', u'http://www.instapaper.com/u/6'),
            (u'Instapaper Unread - Pg. 5', u'http://www.instapaper.com/u/5'),
            (u'Instapaper Unread - Pg. 4', u'http://www.instapaper.com/u/4'),
            (u'Instapaper Unread - Pg. 3', u'http://www.instapaper.com/u/3'),
            (u'Instapaper Unread - Pg. 2', u'http://www.instapaper.com/u/2'),
            (u'Instapaper Unread - Pg. 1', u'http://www.instapaper.com/u/1'),
            (u'Instapaper Starred', u'http://www.instapaper.com/starred')
            ]

    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None:
            br.open(self.LOGIN)
            br.select_form(nr=0)
            br['username'] = self.username
            if self.password is not None:
               br['password'] = self.password
            br.submit()
        return br

     def parse_index(self):
        totalfeeds = []
        lfeeds = self.get_feeds()
        for feedobj in lfeeds:
            feedtitle, feedurl = feedobj
            self.report_progress(0, _('Fetching feed')+' %s...'%(feedtitle if feedtitle else feedurl))
            articles = []
            soup = self.index_to_soup(feedurl)
            self.myFormKey = soup.find('input', attrs={'name': 'form_key'})['value']
            for item in soup.findAll('div', attrs={'class':'cornerControls'}):
                description = self.tag_to_string(item.div)
                atag = item.a
                if atag and atag.has_key('href'):
                    url         = atag['href']
                    articles.append({
                                     'url'        :url
                                    })
            totalfeeds.append((feedtitle, articles))
        return totalfeeds

    def cleanup(self):
        params = urllib.urlencode(dict(form_key=self.myFormKey, submit="Archive All"))
        self.browser.open("http://www.instapaper.com/bulk-archive", params)

    def print_version(self, url):
        return 'http://www.instapaper.com' + url

    def populate_article_metadata(self, article, soup, first):
        article.title  = soup.find('title').contents[0].strip()

    def postprocess_html(self, soup, first_fetch):
        for link_tag in soup.findAll(attrs={"id" : "story"}):
            link_tag.insert(0,'<h1>'+soup.find('title').contents[0].strip()+'</h1>')

        return soup
This is Darko's recipe that I modified.

Changes:

- I added feeds for 6 unread pages instead of 1. I only have 5 pages, but adding 6 leaves room in case I get more. When I open the file on my kindle, only 5 sections are displayed, so it omits empty ones. I like the 5 sections of 40 articles rather then 1 section of 200 articles.

- Added AutoClean=True. This decrease the size of the download from 4mb to 2.7mb. There's a lot less useless photos.

- Implemented the "Archive All" modification that cendalc/banjopicker created (https://www.mobileread.com/forums/sho...8&postcount=13)

Update - Added "reverse_article_order = True" (Cred: Cendalc) and switched the order of the feeds so that older articles appear first. That way reading can be done in chronological order.


Comments:
I sort of patched this together with trial and error. The parts from def parse_index to the end still confuse me.

I believe the autoclean feature is cleaning already created text version that instapaper creates. Is that true?

If so, how would I go about making program open the links in the feed and then apply the autoclean directly to the webpages themselves? I find that instapapers text feature gives a few too many "Page not available's" and that readability is a bit better.

Lastly, the archive all feature is fine, but is their a way to archive pages as they are opened and pakaged? That way if someone wanted to download only a few articles, their entire collection wouldn't be archived.

Thanks for any feedback!

(This recipe stuff is cool! )

Last edited by haroldtreen; 09-01-2011 at 05:33 PM. Reason: Added article reverse
haroldtreen is offline   Reply With Quote
Old 08-30-2011, 04:23 PM   #4
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by haroldtreen View Post
The parts from def parse_index to the end still confuse me.
Do you have any specific questions? Is it the content of def parse_index that confuses you or the def statements after that?
Starson17 is offline   Reply With Quote
Old 08-30-2011, 04:58 PM   #5
haroldtreen
Member
haroldtreen began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Aug 2011
Location: Toronto, Canada
Device: Kindle 3
It's a lot of the variables such as "lfeeds", "feedobj", "self" (what's the importance of it being enclosed in parse_index() ).

I think because I only know C, that I'm confused why these things that look like variables aren't defined. I think it would be useful to have a more complex recipe like this with #comments beside things explaining what things are doing.

And def is to call a defined function?

I really need to pickup a good book on python :P.
haroldtreen is offline   Reply With Quote
Old 08-31-2011, 12:34 AM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,196
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
autoclean works on whatever is downloaded. If your recipe is downloading plain text then it will run on that. If you want to run it on the orginal you will need to modify the recipe to download the original html. Not being an instapaper user, I can't tell you how to do that.
kovidgoyal is offline   Reply With Quote
Old 09-01-2011, 05:31 PM   #7
haroldtreen
Member
haroldtreen began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Aug 2011
Location: Toronto, Canada
Device: Kindle 3
Thanks again Kovid!

So I realized that the code is looking for HTML tags that hold the info you want to clean. So even though I don't know what lots of the python means, I can see somewhat what is going on.

I changed Darko's code to pull the URL from the website, which is then cleaned with the AutoClean feature.

With that, I believe this recipe does exactly what I want now.

1) Pull all unread articles from Instapaper
2) Download a readability version of each article
3) Archive all the articles

As of now the only problems are

1) Anyone with more then 6 pages of unread articles won't get ALL their articles

2) All articles are archived as part of the cleanup. Their should be a way to select the archive option after the URL is fetched...but that sort of python is beyond me... if any developers knows how I would love to see it.

3) Articles downloaded with this recipe seem to have fewer images then before...

I looked at 1 webpage in three ways to see what might be up.

- When downloaded with the recipe it has no images
- When taken from the "text only feature" of instapaper it contains multiple images (although many which weren't meant to be part of the article).
- When taken with readability inside chrome it shows correctly with 1 image.

This is me being a perfectionist though. As long as all the content gets downloaded, I'm happy.

I'm going to post the new recipe in my post above. I'll include # comments so others with no coding background can modify it to their liking.

Cheers!
haroldtreen is offline   Reply With Quote
Old 10-08-2011, 06:06 AM   #8
mojofleur
Junior Member
mojofleur began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Oct 2011
Device: Nook Touch
TOC

Hi! This is my first post... Hope my English is good enough, as I'm italian!

First of all, I'd like to thank haroldtreen for his recipe. It works smoothly!

I just have one question. As I switched from a Kindle to a Nook Touch, I experienced a simple but annoying problem. The ePub made with his recipe has a two-level TOC, but my Nook only reads the first level.


How can I get my articles as first-level in an ePub TOC?

Last edited by mojofleur; 06-23-2012 at 07:21 AM.
mojofleur is offline   Reply With Quote
Old 03-25-2012, 03:48 AM   #9
bosun120
Member Retired
bosun120 began at the beginning.
 
Posts: 23
Karma: 40
Join Date: Sep 2011
Device: Android
Very cool recipe! Although it doesn't seem to auto-archive the downloaded articles on my Instapaper account though. Oh well, I can always manually do it from the site.
bosun120 is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Instapaper - Updated recipe khromov Recipes 78 01-23-2015 01:09 AM
Custom recipe question jdomingos76 Recipes 1 02-10-2011 07:46 AM
Need help regarding custom recipe gagsays Calibre 3 05-26-2010 07:48 PM
Custom Recipe CABITSS Introduce Yourself 2 09-22-2009 10:30 AM
Custom Recipe CABITSS Calibre 3 09-22-2009 10:29 AM


All times are GMT -4. The time now is 12:53 PM.


MobileRead.com is a privately owned, operated and funded community.