Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 09-18-2011, 07:57 AM   #1
a.peter
Enthusiast
a.peter began at the beginning.
 
Posts: 28
Karma: 10
Join Date: Sep 2011
Device: Sony PRS-350, Kindle Touch
Recipe for german newspaper "Berliner Zeitung"

Hello to everyone,

i've just finished my first recipe for the german newspaper "Berliner Zeitung". Here it is. In case of any trouble feel free to contact me. I'll try to fix any issues as soon as possible.

Spoiler:

Code:
from calibre.web.feeds.recipes import BasicNewsRecipe
import re

class SportsIllustratedRecipe(BasicNewsRecipe) :
    __author__    = 'ape'
    __copyright__ = 'ape'
    __license__   = 'GPL v3'
    language      = 'de'
    description   = 'Berliner Zeitung'
    version       = 3
    title         = u'Berliner Zeitung'
    timefmt       = ' [%d.%m.%Y]' 

    no_stylesheets = True
    remove_javascript = True
    use_embedded_content = False
    publication_type = 'newspaper'
    
    keep_only_tags = [dict(name='div', attrs={'class':'teaser t_split t_artikel'})]

    INDEX = 'http://www.berlinonline.de/berliner-zeitung/'

    def parse_index(self):
        base = 'http://www.berlinonline.de'
        ressorts = []
        articles = {}
        more = 1
        
        soup = self.index_to_soup(self.INDEX)
        
        # Get list of links to ressorts from index page
        ressort_list = soup.findAll('ul', attrs={'class': re.compile('ressortlist')})
        for ressort in ressort_list[0].findAll('a'):
            feed_title = ressort.string
            print 'Analyzing', feed_title
            if not articles.has_key(feed_title):
                articles[feed_title] = []
                ressorts.append(feed_title)
            # Load ressort page.
            feed = self.index_to_soup('http://www.berlinonline.de' + ressort['href'])
            # find mainbar div which contains the list of all articles
            for article_container in feed.findAll('div', attrs={'class': re.compile('mainbar')}):
                # iterate over all articles
                for article_teaser in article_container.findAll('div', attrs={'class': 'teaser'}):
                    # extract the short description of the article
                    description = article_teaser.find('div', attrs={'class':'inner'}).p.contents[0]
                    
                    # extract title of article
                    if article_teaser.h3 != None:
                        article = {'title' : article_teaser.h3.a.string, 'date' : u'', 'url'  : base + article_teaser.h3.a['href'], 'description' : description}
                        articles[feed_title].append(article)
                    else:
                        # Skip teasers for missing photos
                        if article_teaser.div.p.contents[0].find('Foto:') > -1:
                            continue
                        article = {'title': 'Weitere Artikel ' + str(more), 'date': u'', 'url': base + article_teaser.div.p.a['href'], 'description': u''}
                        articles[feed_title].append(article)
                        more += 1
        answer = [(ressort, articles[ressort]) for ressort in ressorts if articles.has_key(ressort)]
        # answer structure
        # [('genre1', [{'date': ..., 'url': ..., 'description': ..., 'title': ...},
        #              {'date': ..., 'url': ..., 'description': ..., 'title': ...}]),
        #  ('genre2', [{'date': ..., 'url': ..., 'description': ..., 'title': ...}])]
        # Liste[ Tuple( genre, liste[{artikel},...]), Tuple( genre, liste[{artikel},...])]
        return answer
        
    def get_masthead_url(self):
        return 'http://www.berlinonline.de/.img/berliner-zeitung/blz_logo.gif'

Last edited by a.peter; 09-22-2011 at 02:28 PM. Reason: Version 3 of recipe
a.peter is offline   Reply With Quote
Old 12-13-2011, 03:02 PM   #2
a.peter
Enthusiast
a.peter began at the beginning.
 
Posts: 28
Karma: 10
Join Date: Sep 2011
Device: Sony PRS-350, Kindle Touch
Version 4 of the recipe

Since the Berliner Zeitung has changed its web pages, a new RSS based recipe was needed. Here it is.

Spoiler:

Code:
from calibre.web.feeds.recipes import BasicNewsRecipe

'''Calibre recipe to convert the RSS feeds of the Berliner Zeitung to an ebook.'''

class SportsIllustratedRecipe(BasicNewsRecipe) :
    __author__    = 'a.peter'
    __copyright__ = 'a.peter'
    __license__   = 'GPL v3'
    language      = 'de'
    description   = 'Berliner Zeitung RSS'
    version       = 4
    title         = u'Berliner Zeitung RSS'
    timefmt       = ' [%d.%m.%Y]' 

    #oldest_article = 7.0
    no_stylesheets = True
    remove_javascript = True
    use_embedded_content = False
    publication_type = 'newspaper'
    
    remove_tags_before = dict(name='div', attrs={'class':'newstype'})
    remove_tags_after = [dict(id='article_text')]
    
    feeds = [(u'Startseite', u'http://www.berliner-zeitung.de/home/10808950,10808950,view,asFeed.xml'), 
             (u'Politik', u'http://www.berliner-zeitung.de/home/10808018,10808018,view,asFeed.xml'), 
             (u'Wirtschaft', u'http://www.berliner-zeitung.de/home/10808230,10808230,view,asFeed.xml'), 
             (u'Berlin', u'http://www.berliner-zeitung.de/home/10809148,10809148,view,asFeed.xml'), 
             (u'Brandenburg', u'http://www.berliner-zeitung.de/home/10809312,10809312,view,asFeed.xml'), 
             (u'Wissenschaft', u'http://www.berliner-zeitung.de/home/10808894,10808894,view,asFeed.xml'), 
             (u'Digital', u'http://www.berliner-zeitung.de/home/10808718,10808718,view,asFeed.xml'), 
             (u'Kultur', u'http://www.berliner-zeitung.de/home/10809150,10809150,view,asFeed.xml'), 
             (u'Panorama', u'http://www.berliner-zeitung.de/home/10808334,10808334,view,asFeed.xml'), 
             (u'Sport', u'http://www.berliner-zeitung.de/home/10808794,10808794,view,asFeed.xml'), 
             (u'Hertha', u'http://www.berliner-zeitung.de/home/10808800,10808800,view,asFeed.xml'), 
             (u'Union', u'http://www.berliner-zeitung.de/home/10808802,10808802,view,asFeed.xml'), 
             (u'Verkehr', u'http://www.berliner-zeitung.de/home/10809298,10809298,view,asFeed.xml'), 
             (u'Polizei', u'http://www.berliner-zeitung.de/home/10809296,10809296,view,asFeed.xml'), 
             (u'Meinung', u'http://www.berliner-zeitung.de/home/10808020,10808020,view,asFeed.xml')]
    
    def get_masthead_url(self):
        return 'http://www.berliner-zeitung.de/image/view/10810244,7040611,data,logo.png'
        
    def print_version(self, url):
        return url.replace('.html', ',view,printVersion.html')
a.peter is offline   Reply With Quote
Advert
Reply

Tags
calibe, newspaper, recipe

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Recipe for "Galicia Confidencial" and "De L a V" roebek Recipes 1 07-19-2011 09:17 AM
recipe for "ALDI Süd Wochenflyer" - german schuster Recipes 0 05-18-2011 03:46 PM
Recipe for Dutch newspaper "Dagblad van het Noorden" reijndert Recipes 2 05-18-2011 07:52 AM
recipe for Neuss-Grevenbroicher-Zeitung (NGZ) - german schuster Recipes 0 05-14-2011 12:50 PM
Calibre recipe for daily Portuguese newspaper "Correio da Manhã" jmst Recipes 2 11-01-2010 01:01 PM


All times are GMT -4. The time now is 05:35 AM.


MobileRead.com is a privately owned, operated and funded community.