How To Geek - Recipe Update

JoxX · 05-10-2013, 10:53 AM

Today i updated my first recipe, so I appreciate any suggestions.

Improvements

Instead of only fetching the first lines of every article,
this fetches the whole articles
Fetch time is now very fast, fetches only the needed content
My one 0:28 minutes vs 2:33 minutes Old one

Bugs
Page break after each converted <h2> tag in the created epub:
<div class="mbp_pagebreak"></div>
How to get rid of it? (Tried to change the common conversion options
of Calibre, but they don't affect the news fetch, or?)
This causes a page break after each article-heading, so the heading
is alone on the first site, and the content starts on the next site.

And Calibre can't fetch 'lazy load' images i guess?
Images in the article won't be fetched, only
a gray circle indicating to the 'lazy load'-feature of this images.

Code:

# Based on TonytheBookworm's original recipe
__license__   = 'GPL v3'
__copyright__ = '2013, Johannes Kopf'

import re
from calibre.web.feeds.news import BasicNewsRecipe
class AdvancedUserRecipe1282101454(BasicNewsRecipe):
    title = u'How To Geek'
    language = 'en'
    __author__ = 'Johannes Kopf'
    description = 'Daily Computer Tips and Tricks'
    publisher = 'Howtogeek'
    category = 'PC,tips,tricks'
    oldest_article = 2
    max_articles_per_feed = 50
    no_stylesheets = True
    remove_javascript = True
    masthead_url = 'http://blog.stackoverflow.com/wp-content/uploads/how-to-geek-logo.png'
    cover_url = 'http://www.howtogeek.com/geekers/up/sshot4ebc09559ecbf.jpg'
    recursions = 1
    # Fetch only links from howtogeek.com/number
    match_regexps = [r'http://www.howtogeek.com/\d*']
    remove_tags = [
	dict(name='img',  attrs={'src':re.compile('.*readmore-button.png.*',re.IGNORECASE)}),
	dict(name='img',  attrs={'class':re.compile('.*lazyLoad.*',re.IGNORECASE)})]
    remove_tags_before = dict(name='div', attrs={'class':['thecontent']})
    remove_tags_after = dict(name='div', attrs={'class':['thecontent']})
    keep_only_tags = [
	dict(name='div', attrs={'class':['thecontent']}),
	dict(name=['h2', 'h3']),
	dict(name='a', attrs={'href':re.compile('.*http://www.howtogeek.com/\d*.*',re.IGNORECASE)})]
    feeds = [(u'Tips', u'http://feeds.howtogeek.com/howtogeek')]

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
metro uk recipe update	fleclerc	Recipes	2	01-20-2013 03:30 PM
The Economist Recipe Update	rainrdx	Recipes	1	01-17-2013 11:17 PM
shortlist.com recipe update	scissors	Recipes	3	05-19-2012 02:22 AM
Den of Geek Recipe (Nerdy News Feed)	mrjaded	Recipes	0	09-25-2011 12:10 PM
Kurier recipe update	clanger9	Recipes	0	09-24-2011 10:45 AM