MobileRead Forums - View Single Post

haroldtreen · 08-30-2011, 02:18 PM

UPDATE: Check below for the latest version of the recipe. Feel free to give it a try

Hi

,

I'm new to the ebook scene, but I have stumbled across Calibre and it is pretty amazing. Kudos to all the developers!

I've been trying to find a way to have my unread instapaper articles downloaded, placed into an ebook and then marked Archived on the instapaper site. From what I can tell this seems possible through a recipe and Instapaper APIs.

I have tried Darko's recipe but right now it is only fetching the most recent 40 articles (about the same number that appear on the first page of instapaper). I was hoping to be able to download all my articles. This recipe also doesn't archive articles after they're downloaded.

I also noticed that Calibre now has auto-clean using readability which I would prefer over the instapaper text-only feature.

If it's basic enough that someone could write it for me, that would be great. Otherwise, if anyone could give me clues as to where to start, Python resources to read, that would be really appreciated too. I'd love learn

.

I have a little experience with programming (C, HTML), but nothing with python.

Thanks in advance for any help!

Newest Recipe (01.09.2011)

Code:

import urllib
from calibre import strftime
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1299694372(BasicNewsRecipe):
    title                             = u'Instapaper Recipe'
    __author__                  = 'Darko Miletic'
    publisher                     = 'Instapaper.com'
    category                      = 'info, custom, Instapaper'
    oldest_article               = 365
    max_articles_per_feed = 100
    auto_cleanup=True
    
    ###Have the articles downloaded in reverse order so that the oldest articles appear first.
    reverse_article_order=True
    
    needs_subscription    = True
    INDEX                 = u'http://www.instapaper.com'
    LOGIN                 = INDEX + u'/user/login'

###6 Pages of Articles downloaded to ensure none are missed. 6 Pages = 240 Articles. 
###Page order is reversed to ensure that oldest articles are downloaded first.

    feeds          = [
            (u'Instapaper Unread - Pg. 6 ', u'http://www.instapaper.com/u/6'),
            (u'Instapaper Unread - Pg. 5', u'http://www.instapaper.com/u/5'),
            (u'Instapaper Unread - Pg. 4', u'http://www.instapaper.com/u/4'),
            (u'Instapaper Unread - Pg. 3', u'http://www.instapaper.com/u/3'),
            (u'Instapaper Unread - Pg. 2', u'http://www.instapaper.com/u/2'),
            (u'Instapaper Unread - Pg. 1', u'http://www.instapaper.com/u/1'),
            (u'Instapaper Starred', u'http://www.instapaper.com/starred')
            ]

    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None:
            br.open(self.LOGIN)
            br.select_form(nr=0)
            br['username'] = self.username
            if self.password is not None:
               br['password'] = self.password
            br.submit()
        return br

    def parse_index(self):
        totalfeeds = []
        lfeeds = self.get_feeds()
        for feedobj in lfeeds:
            feedtitle, feedurl = feedobj
            self.report_progress(0, _('Fetching feed')+' %s...'%(feedtitle if feedtitle else feedurl))
            articles = []
            soup = self.index_to_soup(feedurl)
            self.myFormKey = soup.find('input', attrs={'name': 'form_key'})['value']
            for item in soup.findAll('div', attrs={'class':'titleRow'}):
                #description = self.tag_to_string(item.div)
                atag = item.a
                if atag and atag.has_key('href'):
                    url         = atag['href']
                    articles.append({
                                     'url'        :url
                                    })
            totalfeeds.append((feedtitle, articles))
        return totalfeeds

#### Delete "#" to have the recipe archive all your unread articles after downloading.
    #def cleanup(self):
      #  params = urllib.urlencode(dict(form_key=self.myFormKey, submit="Archive All"))
      #  self.browser.open("http://www.instapaper.com/bulk-archive", params)

    def print_version(self, url):
         return url

    def populate_article_metadata(self, article, soup, first):
        article.title  = soup.find('title').contents[0].strip()

    def postprocess_html(self, soup, first_fetch):
        for link_tag in soup.findAll(attrs={"id" : "story"}):
            link_tag.insert(0,'<h1>'+soup.find('title').contents[0].strip()+'</h1>')

        return soup

This is a modified version of Darko's original Instapaper recipe.

Changes:

- Multiple pages of articles downloaded, not just the first 40.
- Article order is reversed so oldest articles appear first. (Cred. Cendalc)
- Ability to have all articles archived after they are downloaded. This stops Calibre from downloading the same articles over and over. (Delete "#"s to enable) (Cred. Cendalc, Banjopicker)
- Original web-content is downloaded and simplified with readability rather then instapaper's text only feature. This works better in my experience. No more problems with some webpages not downloading.

Known Bugs:

- Less images downloading then before. (Some may prefer this as it saves space...)
- All articles archived rather then just the ones downloaded.

Test it and give me feedback. I have no python experience so this might be messy :P.

Thanks!