I'm still having trouble to get a recipe for
http://p.yimg.com/bw/rss/nachrichten/bundeswehr.xml
cleared of unnecessary clutter, am still getting artifacts.
The modified basic news recipe works in principle and removes much of the clutter but still includes, among others, a "ghost" of an add:
Quote:
class AdvancedUserRecipe1264591440(BasicNewsRecipe):
title = u'Bundeswehr'
oldest_article = 7
max_articles_per_feed = 100
remove_tags_after = dict(name='div', attrs={'id':'content'})
remove_tags_before = dict(name='div', attrs={'id':'content'})
feeds = [(u'Bundeswehr in AFP und AP', u'http://p.yimg.com/bw/rss/nachrichten/bundeswehr.xml')]
|
Could anyone jump in with advice?
I want to get a "filtered" recipe going to scan several rss-feeds and filter out all articles that don't contain certain keywords so that only news items that do contain those keywords are included in the created e-book, thus creating an instant press review on a certain theme/person/event etc. Kovidgoyal has confirmed the possibility of doing this with calibre:
Quote:
Originally Posted by kovidgoyal
If you've seen http://bazaar.launchpad.net/~kovid/c.../feeds/news.py
there's not much more I can tell you. Basically, you can completely customize the news download process by overring the methods of that class. So if you want to create a compsite recipe you would create a parse_index method that will list all the current articles in your various news sources. Then you would override postprocess_html to check for the required keywords and if absent return None
|
but I'm afraid that this is currently beyond my programming/scripting skills. As this would be a rather extensive recipe I'm hesitant to simply request it in this forum but could someone post a recipe with a keyword filter so I can learn from the example?