Today i updated my first recipe, so I appreciate any suggestions.
Improvements- Instead of only fetching the first lines of every article,
this fetches the whole articles
- Fetch time is now very fast, fetches only the needed content
My one 0:28 minutes vs 2:33 minutes Old one
Bugs
Page break after each converted <h2> tag in the created epub:
<div class="mbp_pagebreak"></div>
How to get rid of it? (Tried to change the common conversion options
of Calibre, but they don't affect the news fetch, or?)
This causes a page break after each article-heading, so the heading
is alone on the first site, and the content starts on the next site.
And Calibre can't fetch 'lazy load' images i guess?
Images in the article won't be fetched, only
a gray circle indicating to the 'lazy load'-feature of this images.
Code:
# Based on TonytheBookworm's original recipe
__license__ = 'GPL v3'
__copyright__ = '2013, Johannes Kopf'
import re
from calibre.web.feeds.news import BasicNewsRecipe
class AdvancedUserRecipe1282101454(BasicNewsRecipe):
title = u'How To Geek'
language = 'en'
__author__ = 'Johannes Kopf'
description = 'Daily Computer Tips and Tricks'
publisher = 'Howtogeek'
category = 'PC,tips,tricks'
oldest_article = 2
max_articles_per_feed = 50
no_stylesheets = True
remove_javascript = True
masthead_url = 'http://blog.stackoverflow.com/wp-content/uploads/how-to-geek-logo.png'
cover_url = 'http://www.howtogeek.com/geekers/up/sshot4ebc09559ecbf.jpg'
recursions = 1
# Fetch only links from howtogeek.com/number
match_regexps = [r'http://www.howtogeek.com/\d*']
remove_tags = [
dict(name='img', attrs={'src':re.compile('.*readmore-button.png.*',re.IGNORECASE)}),
dict(name='img', attrs={'class':re.compile('.*lazyLoad.*',re.IGNORECASE)})]
remove_tags_before = dict(name='div', attrs={'class':['thecontent']})
remove_tags_after = dict(name='div', attrs={'class':['thecontent']})
keep_only_tags = [
dict(name='div', attrs={'class':['thecontent']}),
dict(name=['h2', 'h3']),
dict(name='a', attrs={'href':re.compile('.*http://www.howtogeek.com/\d*.*',re.IGNORECASE)})]
feeds = [(u'Tips', u'http://feeds.howtogeek.com/howtogeek')]