MobileRead Forums - View Single Post - How To Geek

JoxX · 05-10-2013, 09:53 AM

Today i updated my first recipe, so I appreciate any suggestions.

Improvements

Instead of only fetching the first lines of every article,
this fetches the whole articles
Fetch time is now very fast, fetches only the needed content
My one 0:28 minutes vs 2:33 minutes Old one

Bugs
Page break after each converted <h2> tag in the created epub:
<div class="mbp_pagebreak"></div>
How to get rid of it? (Tried to change the common conversion options
of Calibre, but they don't affect the news fetch, or?)
This causes a page break after each article-heading, so the heading
is alone on the first site, and the content starts on the next site.

And Calibre can't fetch 'lazy load' images i guess?
Images in the article won't be fetched, only
a gray circle indicating to the 'lazy load'-feature of this images.

Code:

# Based on TonytheBookworm's original recipe
__license__   = 'GPL v3'
__copyright__ = '2013, Johannes Kopf'

import re
from calibre.web.feeds.news import BasicNewsRecipe
class AdvancedUserRecipe1282101454(BasicNewsRecipe):
    title = u'How To Geek'
    language = 'en'
    __author__ = 'Johannes Kopf'
    description = 'Daily Computer Tips and Tricks'
    publisher = 'Howtogeek'
    category = 'PC,tips,tricks'
    oldest_article = 2
    max_articles_per_feed = 50
    no_stylesheets = True
    remove_javascript = True
    masthead_url = 'http://blog.stackoverflow.com/wp-content/uploads/how-to-geek-logo.png'
    cover_url = 'http://www.howtogeek.com/geekers/up/sshot4ebc09559ecbf.jpg'
    recursions = 1
    # Fetch only links from howtogeek.com/number
    match_regexps = [r'http://www.howtogeek.com/\d*']
    remove_tags = [
	dict(name='img',  attrs={'src':re.compile('.*readmore-button.png.*',re.IGNORECASE)}),
	dict(name='img',  attrs={'class':re.compile('.*lazyLoad.*',re.IGNORECASE)})]
    remove_tags_before = dict(name='div', attrs={'class':['thecontent']})
    remove_tags_after = dict(name='div', attrs={'class':['thecontent']})
    keep_only_tags = [
	dict(name='div', attrs={'class':['thecontent']}),
	dict(name=['h2', 'h3']),
	dict(name='a', attrs={'href':re.compile('.*http://www.howtogeek.com/\d*.*',re.IGNORECASE)})]
    feeds = [(u'Tips', u'http://feeds.howtogeek.com/howtogeek')]