Here is my first attempt at at custom recipe. It is for the German Language course feeds are DW-World.de. I will reuse this same recipe to access the DW-World news feeds, but this is the one I completed first.
I do have one small problem. At the top and bottom of every article are a set of (unwanted) links. The HTML source is:
Code:
<p class="actionFooter"><a href="/dw/article/0,,4529629,00.html">DW-WORLD.DE</a><span>*|*</span><a href="javascript:window.print()">Drucken</a>
</p>
This code occurs at both the top and bottom of the page. Of course the URL number varies from page to page. Note there is a CR between </a> and </p>.
Tips on the best way to eliminate this would be much appreciated. I tried both "remove_tags" and "preprocess_regexps," but in both cases I managed to eliminate not only the offending code, but the entire content of the page. Ooops.
Thanks much.. Paul
Code:
#!/usr/bin/env python
__license__ = 'GPL v3'
__copyright__ = '2009, Less Paul <LessPaul at gmail.com>'
'''
dw-world.de
'''
from calibre.web.feeds.news import BasicNewsRecipe
class DW_World_courses(BasicNewsRecipe):
title = 'DW-World - German Courses'
__author__ = 'LessPaul'
description = "German language courses and lesson feeds from the multi-language German news site DW-World.de"
publisher = 'Deutsche Welle'
category = 'German, Language, Education'
oldest_article = 30
max_articles_per_feed = 100
language = 'de'
lang = 'de-DE'
no_stylesheets = True
use_embedded_content = False
remove_javascript = True
conversion_options = { 'tags' : category,
'publisher' : publisher,
'language' : lang
}
feeds = [(u'Deutsch als Fremdsprache', u'http://rss.dw-world.de/rdf/DKfeed_dkmix_de'), (u'Deutsch im Fokus', u'http://rss.dw-world.de/rdf/DKfeed_dif_de'), (u'Alltagsdeutsch', u'http://rss.dw-world.de/rdf/DKfeed_alltagsdeutsch_de'), (u'Wort der Woche', u'http://rss.dw-world.de/rdf/DKfeed_wortderwoche_de'), (u'Sprachbar', u'http://rss.dw-world.de/rdf/DKfeed_sprachbar_de'), (u'Stichwort', u'http://rss.dw-world.de/rdf/DKfeed_stichwort_de'), (u'Top-Thema mit Vokabeln', u'http://rss.dw-world.de/rdf/DKfeed_topthemamitvokabeln_de'), (u'Langsam gesprochene Nachrichten', u'http://rss.dw-world.de/rdf/DKfeed_lgn_de')]
def print_version(self, url):
target = url.rpartition('/')[2]
print_url = 'http://www.dw-world.de/popups/popup_printcontent/' + target
return print_url