MobileRead Forums - View Single Post

haroldtreen · 08-30-2011, 03:57 PM

Thanks Kovid!

This is where I am so far...

Code:

import urllib
from calibre import strftime
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1299694372(BasicNewsRecipe):
    title                             = u'InstapaperAuto'
    __author__                  = 'Darko Miletic'
    publisher                     = 'Instapaper.com'
    category                      = 'info, custom, Instapaper'
    oldest_article               = 365
    max_articles_per_feed = 100
    auto_cleanup=True
    reverse_article_order = True
    
    needs_subscription    = True
    INDEX                 = u'http://www.instapaper.com'
    LOGIN                 = INDEX + u'/user/login'


    feeds          = [
            (u'Instapaper Unread - Pg. 6 ', u'http://www.instapaper.com/u/6'),
            (u'Instapaper Unread - Pg. 5', u'http://www.instapaper.com/u/5'),
            (u'Instapaper Unread - Pg. 4', u'http://www.instapaper.com/u/4'),
            (u'Instapaper Unread - Pg. 3', u'http://www.instapaper.com/u/3'),
            (u'Instapaper Unread - Pg. 2', u'http://www.instapaper.com/u/2'),
            (u'Instapaper Unread - Pg. 1', u'http://www.instapaper.com/u/1'),
            (u'Instapaper Starred', u'http://www.instapaper.com/starred')
            ]

    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None:
            br.open(self.LOGIN)
            br.select_form(nr=0)
            br['username'] = self.username
            if self.password is not None:
               br['password'] = self.password
            br.submit()
        return br

     def parse_index(self):
        totalfeeds = []
        lfeeds = self.get_feeds()
        for feedobj in lfeeds:
            feedtitle, feedurl = feedobj
            self.report_progress(0, _('Fetching feed')+' %s...'%(feedtitle if feedtitle else feedurl))
            articles = []
            soup = self.index_to_soup(feedurl)
            self.myFormKey = soup.find('input', attrs={'name': 'form_key'})['value']
            for item in soup.findAll('div', attrs={'class':'cornerControls'}):
                description = self.tag_to_string(item.div)
                atag = item.a
                if atag and atag.has_key('href'):
                    url         = atag['href']
                    articles.append({
                                     'url'        :url
                                    })
            totalfeeds.append((feedtitle, articles))
        return totalfeeds

    def cleanup(self):
        params = urllib.urlencode(dict(form_key=self.myFormKey, submit="Archive All"))
        self.browser.open("http://www.instapaper.com/bulk-archive", params)

    def print_version(self, url):
        return 'http://www.instapaper.com' + url

    def populate_article_metadata(self, article, soup, first):
        article.title  = soup.find('title').contents[0].strip()

    def postprocess_html(self, soup, first_fetch):
        for link_tag in soup.findAll(attrs={"id" : "story"}):
            link_tag.insert(0,'<h1>'+soup.find('title').contents[0].strip()+'</h1>')

        return soup

This is Darko's recipe that I modified.

Changes:

- I added feeds for 6 unread pages instead of 1. I only have 5 pages, but adding 6 leaves room in case I get more. When I open the file on my kindle, only 5 sections are displayed, so it omits empty ones. I like the 5 sections of 40 articles rather then 1 section of 200 articles.

- Added AutoClean=True. This decrease the size of the download from 4mb to 2.7mb. There's a lot less useless photos.

- Implemented the "Archive All" modification that cendalc/banjopicker created (https://www.mobileread.com/forums/sho...8&postcount=13)

Update - Added "reverse_article_order = True" (Cred: Cendalc) and switched the order of the feeds so that older articles appear first. That way reading can be done in chronological order.

Comments:
I sort of patched this together with trial and error. The parts from def parse_index to the end still confuse me.

I believe the autoclean feature is cleaning already created text version that instapaper creates. Is that true?

If so, how would I go about making program open the links in the feed and then apply the autoclean directly to the webpages themselves? I find that instapapers text feature gives a few too many "Page not available's" and that readability is a bit better.

Lastly, the archive all feature is fine, but is their a way to archive pages as they are opened and pakaged? That way if someone wanted to download only a few articles, their entire collection wouldn't be archived.

Thanks for any feedback!

(This recipe stuff is cool!

)