View Single Post
Old 05-16-2012, 11:32 AM   #1
lydgate
Junior Member
lydgate began at the beginning.
 
Posts: 1
Karma: 10
Join Date: May 2012
Device: Kindle Touch 3g
My first recipe (osnews.com)

Hi everyone,

Got a Kindle about a month ago and downloaded Calibre, so far I really love it. I've been playing with the News recipes and think they're great, but wanted to be able to start making my own or modifying the built-in ones. I figured the best way to learn was just to make one, so I made one for a site I read pretty regularly called OSNews.

It took a little while to get working, because the print version of OSNews requires a referer. I managed to hunt down some code in one of the built-ins that shows how to forge this.

I figured out how to get rid of the annoying url text that the site inserts in the text using preprocess_regexps.

I then had a problem because auto_cleanup annoyingly inserted </p> before <a>, causing unwanted paragraph breaks whenever there was a link. I turned off auto_cleanup and used keep_only_tags and this seemed to work better (don't know why).

There's still a few issues though. At first I had it downloading just the most recent RSS, and this worked fine, but now I'm trying to download three sections, and I'm not sure how to get these divided up in the way that many mobis are divided (e.g. NY Times or Ars Technica).

Also, when I had auto_cleanup on, although it caused problems, it also removed <a> tags in the title which I think is better. Not sure how to do this though.

Also, the byline seems to be a bit close to the text, ideally I'd like the formatting to be different the way it is in the NYT.

Here's the code:
Spoiler:
Code:
import mechanize

class AdvancedUserRecipe1336752090(BasicNewsRecipe):
    title          = u'OSNews'
    oldest_article = 7
    max_articles_per_feed = 100
    auto_cleanup = False

    feeds = [(u'Editorials',u'http://www.osnews.com/feed/kind/Editorial'),
             (u'Features', u'http://www.osnews.com/feed/kind/Feature'),
             (u'Interviews', u'http://www.osnews.com/feed/kind/Interview')]

    preprocess_regexps = [(re.compile(r' \[http.*\]', re.IGNORECASE), lambda m: '')]

    keep_only_tags = [ dict(name='div', attrs={'class':'printitem'}),
                       dict(name='div', attrs={'class':'printtitle'}),
                       dict(name='div', attrs={'class':'printcontent'})]
    
    def get_browser(self):
        br = BasicNewsRecipe.get_browser(self)
        cookies = mechanize.CookieJar()
        br = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))
        br.addheaders = [('Referer','http://www.osnews.com/')]
        return br

    def print_version(self, url):
        return url.replace('story','print')


Go easy on me, it's my first one!

Last edited by lydgate; 05-16-2012 at 11:34 AM.
lydgate is offline   Reply With Quote