MobileRead Forums - View Single Post - Calibre + Instapaper not downloading all articles!

Dereks · 04-01-2011, 04:30 PM

Ok. I played around a bit and created the recipe that fetches all plain-text versions of the articles, streight out of instapaper. Here is the code:

Code:

import urllib
from calibre import strftime
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1299694372(BasicNewsRecipe):
    title          = u'Instapaper'
    __author__            = 'Darko Miletic'
    publisher             = 'Instapaper.com'
    category              = 'info, custom, Instapaper'
    oldest_article = 365
    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    needs_subscription    = True
    INDEX                 = u'http://www.instapaper.com'
    LOGIN                 = INDEX + u'/user/login'


    feeds          = [(u'Instapaper Unread', u'http://www.instapaper.com/u'), (u'Instapaper Starred', u'http://www.instapaper.com/starred')]

    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None:
            br.open(self.LOGIN)
            br.select_form(nr=0)
            br['username'] = self.username
            if self.password is not None:
               br['password'] = self.password
            br.submit()
        return br

    def parse_index(self):
        totalfeeds = []
        lfeeds = self.get_feeds()
        for feedobj in lfeeds:
            feedtitle, feedurl = feedobj
            self.report_progress(0, _('Fetching feed')+' %s...'%(feedtitle if feedtitle else feedurl))
            articles = []
            soup = self.index_to_soup(feedurl)
            for item in soup.findAll('div', attrs={'class':'cornerControls'}):
                description = self.tag_to_string(item.div)
                atag = item.a
                if atag and atag.has_key('href'):
                    url         = atag['href']
                    articles.append({
                                     'url'        :url
                                    })
            totalfeeds.append((feedtitle, articles))
        return totalfeeds

    def print_version(self, url): 
        return 'http://www.instapaper.com' + url

The only thing that has been changes is basically the div tag, which wraps the link to the article.
The problem is, that this particular tag contains no information about Title date or description. The latter two are not important for me personally, but first one is definitely the useful one.
So if you use recipe like this you will get all items in TOC marked as Unknown Article. Even the link itself can't be reused as a Title, since instapaper has them all in numerical value.
Maybe there is a possibility to fetch the title out of the article itself?

Again, I possess next to nothing knowledge of python and pretty basic understanding of recipe API. I'm trying my best, but without direction from somebody more experienced it's just random wandering in the woods.