MobileRead Forums - View Single Post

lydgate · 05-16-2012, 12:32 PM

Hi everyone,

Got a Kindle about a month ago and downloaded Calibre, so far I really love it. I've been playing with the News recipes and think they're great, but wanted to be able to start making my own or modifying the built-in ones. I figured the best way to learn was just to make one, so I made one for a site I read pretty regularly called OSNews.

It took a little while to get working, because the print version of OSNews requires a referer. I managed to hunt down some code in one of the built-ins that shows how to forge this.

I figured out how to get rid of the annoying url text that the site inserts in the text using preprocess_regexps.

I then had a problem because auto_cleanup annoyingly inserted </p> before <a>, causing unwanted paragraph breaks whenever there was a link. I turned off auto_cleanup and used keep_only_tags and this seemed to work better (don't know why).

There's still a few issues though. At first I had it downloading just the most recent RSS, and this worked fine, but now I'm trying to download three sections, and I'm not sure how to get these divided up in the way that many mobis are divided (e.g. NY Times or Ars Technica).

Also, when I had auto_cleanup on, although it caused problems, it also removed <a> tags in the title which I think is better. Not sure how to do this though.

Also, the byline seems to be a bit close to the text, ideally I'd like the formatting to be different the way it is in the NYT.

Here's the code:

Spoiler:

Go easy on me, it's my first one!

05-16-2012, 12:32 PM	#1
lydgate Junior Member Posts: 1 Karma: 10 Join Date: May 2012 Device: Kindle Touch 3g	My first recipe (osnews.com) Hi everyone, Got a Kindle about a month ago and downloaded Calibre, so far I really love it. I've been playing with the News recipes and think they're great, but wanted to be able to start making my own or modifying the built-in ones. I figured the best way to learn was just to make one, so I made one for a site I read pretty regularly called OSNews. It took a little while to get working, because the print version of OSNews requires a referer. I managed to hunt down some code in one of the built-ins that shows how to forge this. I figured out how to get rid of the annoying url text that the site inserts in the text using preprocess_regexps. I then had a problem because auto_cleanup annoyingly inserted </p> before <a>, causing unwanted paragraph breaks whenever there was a link. I turned off auto_cleanup and used keep_only_tags and this seemed to work better (don't know why). There's still a few issues though. At first I had it downloading just the most recent RSS, and this worked fine, but now I'm trying to download three sections, and I'm not sure how to get these divided up in the way that many mobis are divided (e.g. NY Times or Ars Technica). Also, when I had auto_cleanup on, although it caused problems, it also removed <a> tags in the title which I think is better. Not sure how to do this though. Also, the byline seems to be a bit close to the text, ideally I'd like the formatting to be different the way it is in the NYT. Here's the code: Spoiler: Code: import mechanize class AdvancedUserRecipe1336752090(BasicNewsRecipe): title = u'OSNews' oldest_article = 7 max_articles_per_feed = 100 auto_cleanup = False feeds = [(u'Editorials',u'http://www.osnews.com/feed/kind/Editorial'), (u'Features', u'http://www.osnews.com/feed/kind/Feature'), (u'Interviews', u'http://www.osnews.com/feed/kind/Interview')] preprocess_regexps = [(re.compile(r' \[http.\]', re.IGNORECASE), lambda m: '')] keep_only_tags = [ dict(name='div', attrs={'class':'printitem'}), dict(name='div', attrs={'class':'printtitle'}), dict(name='div', attrs={'class':'printcontent'})] def get_browser(self): br = BasicNewsRecipe.get_browser(self) cookies = mechanize.CookieJar() br = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies)) br.addheaders = [('Referer','http://www.osnews.com/')] return br def print_version(self, url): return url.replace('story','print') Go easy on me, it's my first one! Last edited by lydgate; 05-16-2012 at 12:34 PM.*