MobileRead Forums - View Single Post

thoraxe · 10-11-2011, 05:12 PM

Trying to play with Calibre instead of fighting with the browser on the Kindle, just for giggles.

Starting to go through my various blogs, and started with http://www.robbwolf.com

Here's the recipe so far:

PHP Code:


			
from calibre.web.feeds.recipes import BasicNewsRecipe



class RobbWolf(BasicNewsRecipe):

    title          = u'Robb Wolf - Paleo Solution'

    __author__  = 'Erik M Jacobs'

    oldest_article = 7

    max_articles_per_feed = 100

    no_stylesheets = True

    use_embedded_content = False

    feeds          = [(u'Robb Wolf - Paleo Solution', u'http://feeds.feedburner.com/RobbWolfThePaleoSolution?format=xml')]

    keep_only_tags = dict(id='content')

    remove_tags_after = [dict(name='div', attrs={'class':['endpost']})]

    remove_tags = [dict(name='div', attrs={'align':['center']}),

                   dict(name='div', attrs={'class':['postinfo']})]

Main issue I'm having is that the h2 is a link and falls inside of the content, which seems to confuse Calibre. I end up with a single page on the Kindle with just the article title, and then the real article begins on the next page.

Is it possible to use regexp in the keep/remove/etc tags lines?

This is a standard Wordpress blog, but only the abstracts are presented. I tried messing around with the recipe for Mish's Global Economic Analysis but end up basically only getting the abstracts and no real articles.

Any suggestions here?

10-11-2011, 05:12 PM	#1
thoraxe Junior Member Posts: 1 Karma: 10 Join Date: Oct 2011 Device: Kindle	Recipe for "Robb Wolf" Trying to play with Calibre instead of fighting with the browser on the Kindle, just for giggles. Starting to go through my various blogs, and started with http://www.robbwolf.com Here's the recipe so far: PHP Code: from calibre.web.feeds.recipes import BasicNewsRecipe class RobbWolf(BasicNewsRecipe): title = u'Robb Wolf - Paleo Solution' __author__ = 'Erik M Jacobs' oldest_article = 7 max_articles_per_feed = 100 no_stylesheets = True use_embedded_content = False feeds = [(u'Robb Wolf - Paleo Solution', u'http://feeds.feedburner.com/RobbWolfThePaleoSolution?format=xml')] keep_only_tags = dict(id='content') remove_tags_after = [dict(name='div', attrs={'class':['endpost']})] remove_tags = [dict(name='div', attrs={'align':['center']}), dict(name='div', attrs={'class':['postinfo']})] Main issue I'm having is that the h2 is a link and falls inside of the content, which seems to confuse Calibre. I end up with a single page on the Kindle with just the article title, and then the real article begins on the next page. Is it possible to use regexp in the keep/remove/etc tags lines? This is a standard Wordpress blog, but only the abstracts are presented. I tried messing around with the recipe for Mish's Global Economic Analysis but end up basically only getting the abstracts and no real articles. Any suggestions here?