View Single Post
Old 01-11-2010, 11:19 AM   #1126
LessPaul
Connoisseur
LessPaul doesn't litterLessPaul doesn't litter
 
Posts: 50
Karma: 160
Join Date: Jan 2008
Location: Dewitt, MI
Device: Kindle Paperwhite 2021 / PC / iPad
Here is my first attempt at at custom recipe. It is for the German Language course feeds are DW-World.de. I will reuse this same recipe to access the DW-World news feeds, but this is the one I completed first.

I do have one small problem. At the top and bottom of every article are a set of (unwanted) links. The HTML source is:
Code:
<p class="actionFooter"><a href="/dw/article/0,,4529629,00.html">DW-WORLD.DE</a><span>*|*</span><a href="javascript:window.print()">Drucken</a>
</p>
This code occurs at both the top and bottom of the page. Of course the URL number varies from page to page. Note there is a CR between </a> and </p>.

Tips on the best way to eliminate this would be much appreciated. I tried both "remove_tags" and "preprocess_regexps," but in both cases I managed to eliminate not only the offending code, but the entire content of the page. Ooops.

Thanks much.. Paul

Code:
#!/usr/bin/env  python

__license__   = 'GPL v3'
__copyright__ = '2009, Less Paul <LessPaul at gmail.com>'
'''
dw-world.de
'''

from calibre.web.feeds.news import BasicNewsRecipe

class DW_World_courses(BasicNewsRecipe):
    title                 = 'DW-World - German Courses'
    __author__            = 'LessPaul'
    description           = "German language courses and lesson feeds from the multi-language German news site DW-World.de"
    publisher             = 'Deutsche Welle'
    category              = 'German, Language, Education'
    oldest_article        = 30
    max_articles_per_feed = 100
    language              = 'de'
    lang                  = 'de-DE'
    no_stylesheets        = True
    use_embedded_content  = False
    remove_javascript     = True

    conversion_options = { 'tags'             : category,
                           'publisher'        : publisher,
                           'language'         : lang
                         }

    feeds          = [(u'Deutsch als Fremdsprache', u'http://rss.dw-world.de/rdf/DKfeed_dkmix_de'), (u'Deutsch im Fokus', u'http://rss.dw-world.de/rdf/DKfeed_dif_de'), (u'Alltagsdeutsch', u'http://rss.dw-world.de/rdf/DKfeed_alltagsdeutsch_de'), (u'Wort der Woche', u'http://rss.dw-world.de/rdf/DKfeed_wortderwoche_de'), (u'Sprachbar', u'http://rss.dw-world.de/rdf/DKfeed_sprachbar_de'), (u'Stichwort', u'http://rss.dw-world.de/rdf/DKfeed_stichwort_de'), (u'Top-Thema mit Vokabeln', u'http://rss.dw-world.de/rdf/DKfeed_topthemamitvokabeln_de'), (u'Langsam gesprochene Nachrichten', u'http://rss.dw-world.de/rdf/DKfeed_lgn_de')]

    def print_version(self, url):
        target = url.rpartition('/')[2]
        print_url = 'http://www.dw-world.de/popups/popup_printcontent/' + target
        return print_url
LessPaul is offline