Calibre + Instapaper not downloading all articles! - Page 2

matznet · 01-10-2011, 05:46 PM

I am not sure, actually.
With Leopard, the Instapaper recipe seems to work better: it syncs more articles than 10, it also syncs the starred ones. However, it does not sync all anyway.
On the other hand, with Snow Leopard it only syncs 10 articles, no more, no less.

Does the recipe have memory of the articles it synced last time? If an article remains in instapaper unread section, will it be downloaded every day?

Kilgore3K · 03-09-2011, 01:49 PM

To try and narrow this down, I created a custom news source and then piece by piece cut and pasted sections of the script back in (I'm sure there has to be an easier way to debug).

Anyway long story short, by omitting the last section, it works perfectly for me now grabbing both unread and starred articles:
def print_version(self, url):
return self.INDEX + '/text?u=' + urllib.quote(url)

Here is the full script. All credit to the original creator of the script as this is essentially a cut and past of his work.

Code:

import urllib
from calibre import strftime
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1299694372(BasicNewsRecipe):
    title          = u'Instapaper'
    __author__            = 'Darko Miletic'
    publisher             = 'Instapaper.com'
    category              = 'info, custom, Instapaper'
    oldest_article = 365
    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    needs_subscription    = True
    INDEX                 = u'http://www.instapaper.com'
    LOGIN                 = INDEX + u'/user/login'



    feeds          = [(u'Instapaper Unread', u'http://www.instapaper.com/u'), (u'Instapaper Starred', u'http://www.instapaper.com/starred')]

    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None:
            br.open(self.LOGIN)
            br.select_form(nr=0)
            br['username'] = self.username
            if self.password is not None:
               br['password'] = self.password
            br.submit()
        return br

    def parse_index(self):
        totalfeeds = []
        lfeeds = self.get_feeds()
        for feedobj in lfeeds:
            feedtitle, feedurl = feedobj
            self.report_progress(0, _('Fetching feed')+' %s...'%(feedtitle if feedtitle else feedurl))
            articles = []
            soup = self.index_to_soup(feedurl)
            for item in soup.findAll('div', attrs={'class':'titleRow'}):
                description = self.tag_to_string(item.div)
                atag = item.a
                if atag and atag.has_key('href'):
                    url         = atag['href']
                    title       = self.tag_to_string(atag)
                    date        = strftime(self.timefmt)
                    articles.append({
                                      'title'      :title
                                     ,'date'       :date
                                     ,'url'        :url
                                     ,'description':description
                                    })
            totalfeeds.append((feedtitle, articles))
        return totalfeeds

Moderator Notice
Code tags added for readability.

matznet · 03-11-2011, 03:25 AM

Kilgore3K, you might have actually solved it! I've tested it once, and it works. Let's see in the next days if it keeps working.

zach382 · 03-11-2011, 10:55 PM

You sir are a gentleman and a scholar. Thank you so much.

Gomez · 03-18-2011, 04:03 AM

Cool, works for me too! THX

Kilgore3K · 03-18-2011, 01:39 PM

Glad to be of help, now if I can just find the time to read all the articles I keep saving

Dereks · 03-19-2011, 10:23 AM

great recipe! something that was long needed!
many, many thanks!

abracadabra · 03-28-2011, 07:16 AM

Download works great so far, but is there a way to fetch the text-only-version instead of the saved page in total?

Dereks · 03-30-2011, 10:32 AM

Quote:

Originally Posted by abracadabra

Download works great so far, but is there a way to fetch the text-only-version instead of the saved page in total?

+1. I didn't notice at first that it only fetches content of the source directly and assumed the recipe accesses processes text. This greatly diminishes the value of recipe

I'm not sure, but i think it's done through
get_article_url function.
links to the processed text are pretty straight-forward: instapaper.com/go/article_id/go
and you can see those links in the html code of the page, no script is used. So I think it shouldn't be very difficult to amend the recipe's code.

kiklop74 · 03-31-2011, 07:16 AM

You intentionally removed a piece of code that handled text versions of the articles and now complain that it does not work?

kiklop74 · 03-31-2011, 07:19 AM

The real reason recipe stopped working is that structure of the site is changed. I'll see to that this week.

Dereks · 03-31-2011, 09:02 AM

I personally didn't remove anything. I've only started using instapaper recently and that recipe was the only option available.

Dereks · 04-01-2011, 03:30 PM

Ok. I played around a bit and created the recipe that fetches all plain-text versions of the articles, streight out of instapaper. Here is the code:

Code:

import urllib
from calibre import strftime
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1299694372(BasicNewsRecipe):
    title          = u'Instapaper'
    __author__            = 'Darko Miletic'
    publisher             = 'Instapaper.com'
    category              = 'info, custom, Instapaper'
    oldest_article = 365
    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    needs_subscription    = True
    INDEX                 = u'http://www.instapaper.com'
    LOGIN                 = INDEX + u'/user/login'


    feeds          = [(u'Instapaper Unread', u'http://www.instapaper.com/u'), (u'Instapaper Starred', u'http://www.instapaper.com/starred')]

    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None:
            br.open(self.LOGIN)
            br.select_form(nr=0)
            br['username'] = self.username
            if self.password is not None:
               br['password'] = self.password
            br.submit()
        return br

    def parse_index(self):
        totalfeeds = []
        lfeeds = self.get_feeds()
        for feedobj in lfeeds:
            feedtitle, feedurl = feedobj
            self.report_progress(0, _('Fetching feed')+' %s...'%(feedtitle if feedtitle else feedurl))
            articles = []
            soup = self.index_to_soup(feedurl)
            for item in soup.findAll('div', attrs={'class':'cornerControls'}):
                description = self.tag_to_string(item.div)
                atag = item.a
                if atag and atag.has_key('href'):
                    url         = atag['href']
                    articles.append({
                                     'url'        :url
                                    })
            totalfeeds.append((feedtitle, articles))
        return totalfeeds

    def print_version(self, url): 
        return 'http://www.instapaper.com' + url

The only thing that has been changes is basically the div tag, which wraps the link to the article.
The problem is, that this particular tag contains no information about Title date or description. The latter two are not important for me personally, but first one is definitely the useful one.
So if you use recipe like this you will get all items in TOC marked as Unknown Article. Even the link itself can't be reused as a Title, since instapaper has them all in numerical value.
Maybe there is a possibility to fetch the title out of the article itself?

Again, I possess next to nothing knowledge of python and pretty basic understanding of recipe API. I'm trying my best, but without direction from somebody more experienced it's just random wandering in the woods.

kovidgoyal · 04-01-2011, 03:32 PM

You can use the populate_article_metadata method to fill in the title from the actual article contents.

Dereks · 04-01-2011, 06:14 PM

Ok. Here is a pretty much usable recipe. It creates newspapers right out of instapaper-processed texts. No omissions of articles should happen (unless the processing changes again).
It's pretty minimalistic: only title in the TOC, no date or article summary, since I do not use them. But I do encourage you to add this metadata or make it better in some other way.

Code:

import urllib
from calibre import strftime
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1299694372(BasicNewsRecipe):
    title                             = u'Instapaper'
    __author__                  = 'Darko Miletic'
    publisher                     = 'Instapaper.com'
    category                      = 'info, custom, Instapaper'
    oldest_article               = 365
    max_articles_per_feed = 100
    no_stylesheets        = True
    remove_javascript     = True
    remove_tags              = [
	dict(name='div', attrs={'id':'text_controls_toggle'})
	,dict(name='script')
	,dict(name='div', attrs={'id':'text_controls'})
	,dict(name='div', attrs={'id':'editing_controls'})
	 ]
    use_embedded_content  = False
    needs_subscription    = True
    INDEX                 = u'http://www.instapaper.com'
    LOGIN                 = INDEX + u'/user/login'


    feeds          = [(u'Instapaper Unread', u'http://www.instapaper.com/u'), (u'Instapaper Starred', u'http://www.instapaper.com/starred')]

    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None:
            br.open(self.LOGIN)
            br.select_form(nr=0)
            br['username'] = self.username
            if self.password is not None:
               br['password'] = self.password
            br.submit()
        return br

    def parse_index(self):
        totalfeeds = []
        lfeeds = self.get_feeds()
        for feedobj in lfeeds:
            feedtitle, feedurl = feedobj
            self.report_progress(0, _('Fetching feed')+' %s...'%(feedtitle if feedtitle else feedurl))
            articles = []
            soup = self.index_to_soup(feedurl)
            for item in soup.findAll('div', attrs={'class':'cornerControls'}):
                description = self.tag_to_string(item.div)
                atag = item.a
                if atag and atag.has_key('href'):
                    url         = atag['href']
                    articles.append({
                                     'url'        :url
                                    })
            totalfeeds.append((feedtitle, articles))
        return totalfeeds

    def print_version(self, url): 
        return 'http://www.instapaper.com' + url

    def populate_article_metadata(self, article, soup, first):
        article.title  = soup.find('h1').contents[0].strip()

03-18-2011, 01:39 PM	#21
Kilgore3K Junior Member Posts: 3 Karma: 10 Join Date: Mar 2011 Device: Kindle 2	First times the charm Glad to be of help, now if I can just find the time to read all the articles I keep saving

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Calibre + Instapaper Limits	feelsgoodman	Calibre	3	11-27-2010 02:40 AM
Syncing your Instapaper articles to your Kindle	Jeton	Amazon Kindle	0	10-08-2010 03:28 AM
Instapaper folders and Calibre	flyash	Calibre	4	08-13-2010 02:01 AM
Calibre, Instapaper, multipage articles and ordering	flyash	Calibre	1	06-10-2010 07:03 PM
Want best reader for downloading magazine articles, almost bought jetBook for $179	brettmiller	Which one should I buy?	7	01-10-2009 03:01 PM

01-10-2011, 05:46 PM	#16
matznet Junior Member Posts: 3 Karma: 10 Join Date: Jan 2011 Device: sony prs-650	I am not sure, actually. With Leopard, the Instapaper recipe seems to work better: it syncs more articles than 10, it also syncs the starred ones. However, it does not sync all anyway. On the other hand, with Snow Leopard it only syncs 10 articles, no more, no less. Does the recipe have memory of the articles it synced last time? If an article remains in instapaper unread section, will it be downloaded every day?

03-11-2011, 03:25 AM	#18
matznet Junior Member Posts: 3 Karma: 10 Join Date: Jan 2011 Device: sony prs-650	Kilgore3K, you might have actually solved it! I've tested it once, and it works. Let's see in the next days if it keeps working.

03-11-2011, 10:55 PM	#19
zach382 Junior Member Posts: 4 Karma: 10 Join Date: Dec 2010 Device: Kindle	You sir are a gentleman and a scholar. Thank you so much.

03-18-2011, 04:03 AM	#20
Gomez Junior Member Posts: 6 Karma: 10 Join Date: Jan 2011 Device: Kindle 3	Cool, works for me too! THX

03-19-2011, 10:23 AM	#22
Dereks Connoisseur Posts: 57 Karma: 10 Join Date: Feb 2010 Device: Kindle Paperwhite 1	great recipe! something that was long needed! many, many thanks!

03-28-2011, 07:16 AM	#23
abracadabra Junior Member Posts: 1 Karma: 10 Join Date: Mar 2011 Device: none	Download works great so far, but is there a way to fetch the text-only-version instead of the saved page in total?

03-31-2011, 07:16 AM	#25
kiklop74 Guru Posts: 800 Karma: 194644 Join Date: Dec 2007 Location: Argentina Device: Kindle Voyage	You intentionally removed a piece of code that handled text versions of the articles and now complain that it does not work?

03-31-2011, 07:19 AM	#26
kiklop74 Guru Posts: 800 Karma: 194644 Join Date: Dec 2007 Location: Argentina Device: Kindle Voyage	The real reason recipe stopped working is that structure of the site is changed. I'll see to that this week.

03-31-2011, 09:02 AM	#27
Dereks Connoisseur Posts: 57 Karma: 10 Join Date: Feb 2010 Device: Kindle Paperwhite 1	I personally didn't remove anything. I've only started using instapaper recently and that recipe was the only option available.

04-01-2011, 03:32 PM	#29
kovidgoyal creator of calibre Posts: 43,866 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You can use the populate_article_metadata method to fill in the title from the actual article contents.

Advert

Advert