![]() |
#1 |
Member
![]() Posts: 14
Karma: 10
Join Date: Jan 2012
Device: Sony PRS-T1
|
Die Presse recipe
I'm having issues with the recipe for "Die Presse" (an Austrian newspaper). Special characters are not shown correctly: "ü" is "ü", "ß" is "Ã" and so forth.
To be honest, I do not fully understand the code (see here). But by simply removing all parts that seemed to have something to do with the encoding I was able to fix this problem. I also added a "remove_tags_after" entry to get rid of the page footer. Code:
#!/usr/bin/env python2 # vim:fileencoding=utf-8 from __future__ import unicode_literals, division, absolute_import, print_function __license__ = 'GPL v3' __copyright__ = '2009, Gerhard Aigner <gerhard.aigner at gmail.com>' ''' http://www.diepresse.at - Austrian Newspaper ''' import re from calibre.web.feeds.news import BasicNewsRecipe class DiePresseRecipe(BasicNewsRecipe): title = 'Die Presse' __author__ = 'Gerhard Aigner' description = 'DiePresse.com - Die Online-Ausgabe der Österreichischen Tageszeitung Die Presse.' publisher = 'Die Presse Verlags-Gesellschaft m.b.H. Co KG' language = 'de_AT' category = 'news, politics, nachrichten, Austria' use_embedded_content = False remove_empty_feeds = True no_stylesheets = True recursions = 0 oldest_article = 1 max_articles_per_feed = 100 html2lrf_options = [ '--comment' , description , '--category' , category , '--publisher', publisher ] html2epub_options = 'publisher="' + publisher + '"\ncomments="' + description + '"\ntags="' + category + '"' preprocess_regexps = [ (re.compile(r'Textversion', re.DOTALL), lambda match: ''), ] remove_tags = [dict(name='hr'), dict(name='br'), dict(name='small'), dict(name='img'), dict(name='div', attrs={'class':'textnavi'}), dict(name='h1', attrs={'class':'titel'}), dict(name='a', attrs={'class':'print'}), dict(name='div', attrs={'class':'hline'})] remove_tags_after = [ dict(name='div', attrs={'class':'articletext'}) ] feeds = [ ('Politik', 'http://diepresse.com/rss/Politik'), ('Wirtschaft', 'http://diepresse.com/rss/Wirtschaft'), ('Europa', 'http://diepresse.com/rss/EU'), ('Panorama', 'http://diepresse.com/rss/Panorama'), ('Sport', 'http://diepresse.com/rss/Sport'), ('Kultur', 'http://diepresse.com/rss/Kultur'), ('Leben', 'http://diepresse.com/rss/Leben'), ('Tech', 'http://diepresse.com/rss/Tech'), ('Wissenschaft', 'http://diepresse.com/rss/Science'), ('Bildung', 'http://diepresse.com/rss/Bildung'), ('Gesundheit', 'http://diepresse.com/rss/Gesundheit'), ('Recht', 'http://diepresse.com/rss/Recht'), ('Spectrum', 'http://diepresse.com/rss/Spectrum'), ('Meinung', 'http://diepresse.com/rss/Meinung') ] def print_version(self, url): return url.replace('home','text/home') |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
DIE ZEIT Premium recipe doesn't work anymore | Moik | Recipes | 1 | 07-16-2011 01:46 PM |
Deutscher Kindle Store in der Presse | Marc_liest | Amazon Kindle | 3 | 04-22-2011 04:04 AM |
Seriously thoughtful Elena Filatova - für die, die die Seite nicht (mehr) kennen | beachwanderer | Lounge | 4 | 03-17-2011 03:51 AM |
E-Books lire la presse internationale avec un ereader? | MaryMaelle | Forum Français | 2 | 11-20-2010 09:27 AM |
Fantasy Hoffmann, E.T.A.: Die Irrungen und Die Geheimnisse [German]. V1. 10 Nov 2010 | Hokuspokus | ePub Books | 0 | 11-10-2010 05:34 AM |