View Single Post
Old 04-25-2011, 07:21 AM   #3
aerodynamik
Enthusiast
aerodynamik doesn't litteraerodynamik doesn't litter
 
Posts: 43
Karma: 136
Join Date: Mar 2011
Device: Kindle Paperwhite
Quote:
Originally Posted by miwie View Post
Really nice work for "Süddeutsche Magazin"!

Though I cannot give any hints to the question itself let me suggest the following improvements:
  • Use of UTF-8 text for metadata (e.g. title) by prepending text with 'u' (and use Umlauts in the text istelf of course)
  • Set correct metadata for language by using something like conversion_options = {'language' : language}
  • Set publisher in metadata, e.g. like publisher = u'Magazin Verlagsgesellschaft / Süddeutsche Zeitung mbH / 81677 München'

+Karma!
Thanks for the feedback and the karma

I added the conversion options, the publisher and the UTF-8 text for title etc. with Umlauts.

I also took a look again at the comments in preprocess_html. Actually, the comments were still correct at when logging them. Apparently, they would really be modified (incorrectly?) after preprocess_html?

After removing the banner ad the only comment left was google_ads. Removing the comments as in the beautifulsoup documentation would not work, the comments would not be found. I found them and removed the comments with this code
Code:
            comments = next_article.findAll(text=re.compile('google_ad'))
            [comment.extract() for comment in comments]
This is my current version.

Spoiler:
Code:
#!/usr/bin/env  python

__license__   = 'GPL v3'
__copyright__ = '2011, Nikolas Mangold <nmangold at gmail.com>'
'''
sz-magazin.de
'''
from calibre.web.feeds.news import BasicNewsRecipe
from calibre import strftime
import re

class SueddeutscheZeitungMagazin(BasicNewsRecipe):
    title                  = u'Süddeutsche Zeitung Magazin'
    __author__             = 'Nikolas Mangold'
    description            = u'Süddeutsche Zeitung Magazin'
    publisher              = u'Magazin Verlagsgesellschaft / Süddeutsche Zeitung mbH / 81677 München'
    category               = 'Germany'
    no_stylesheets         = True
    encoding               = 'cp1252'
    remove_empty_feeds     = True
    delay                  = 1
    PREFIX                 = 'http://sz-magazin.sueddeutsche.de'
    INDEX                  = PREFIX + '/hefte'
    use_embedded_content   = False
    masthead_url = 'http://sz-magazin.sueddeutsche.de/img/general/logo.gif'
    language               = 'de'
    publication_type       = 'magazine'
    extra_css              = ' body{font-family: Arial,Helvetica,sans-serif} '
    timefmt = '%W %Y'

    conversion_options = {
                          'comment'          : description
                        , 'tags'             : category
                        , 'publisher'        : publisher
                        , 'language'         : language
                        , 'linearize_tables' : True
                        }

    remove_tags_before =  dict(attrs={'class':'vorspann'})
    remove_tags_after  =  dict(attrs={'id':'commentsContainer'})
    remove_tags = [dict(name='ul', attrs={'class':'textoptions'}),dict(name='div', attrs={'class':'BannerBug'}),dict(name='div', attrs={'id':'commentsContainer'}),dict(name='div', attrs={'class':'plugin-linkbox'})]
        
    def parse_index(self):
        feeds = []

        # determine current issue
        index = self.index_to_soup(self.INDEX)
        year_index = index.find('ul', attrs={'class':'hefte-jahre'})
        week_index = index.find('ul', attrs={'class':'heftindex'})
        year = self.tag_to_string(year_index.find('li')).strip()
        tmp = week_index.find('li').a
        week = self.tag_to_string(tmp)
        aktuelles_heft = self.PREFIX + tmp['href']

        # set cover
        self.cover_url = '{0}/img/hefte/thumbs_l/{1}{2}.jpg'.format(self.PREFIX,year,week)

        # find articles and add to main feed
        soup = self.index_to_soup(aktuelles_heft)
        content = soup.find('div',{'id':'maincontent'})
        mainfeed = 'SZ Magazin {0}/{1}'.format(week, year)
        articles = []
        for article in content.findAll('li'):
            txt = article.find('div',{'class':'text-holder'})
            if txt is None:
                continue
            link = txt.find('a')
            desc = txt.find('p')
            title = self.tag_to_string(link).strip()
            self.log('Found article ', title)
            url = self.PREFIX + link['href']
            articles.append({'title' : title, 'date' : strftime(self.timefmt), 'url' : url, 'desc' : desc})
        feeds.append((mainfeed,articles))

        return feeds;

    def preprocess_html(self, soup):
        # determine if multipage, if not bail out
        multipage = soup.find('ul',attrs={'class':'blaettern'})
        if multipage is None:
            return soup;
        
        # get all subsequent pages and delete multipage links
        next_pages = []
        for next in multipage.findAll('li'):
           if next.a is None:
               continue
           nexturl = next.a['href']
           nexttitle = self.tag_to_string(next).strip()
           next_pages.append((self.PREFIX + nexturl,nexttitle))
        multipage.extract()

        # extract article from subsequent pages and insert at end of first page article
        firstpage_article = soup.find('div',attrs={'id':'artikel'})
        position = len(firstpage_article.contents)
        offset = 0
        for url, title in next_pages:
            next_soup = self.index_to_soup(url)
            next_article = next_soup.find('div',attrs={'id':'artikel'})

            # remove banner ad
            banner = next_article.find('div',attrs={'class':'BannerBug'})
            if banner:
                banner.extract()

            # remove remaining HTML comments
            comments = next_article.findAll(text=re.compile('google_ad'))
            [comment.extract() for comment in comments]

            firstpage_article.insert(position + offset, next_article)
            offset += len(next_article.contents)

        return firstpage_article


The following could still be done
  • Image galleries would still need fixing, but the webpage has again at least two different ways to implement image galleries
  • add blogs and 'kolumnen'. Again blogs are differently formatted than 'kolumnen'
  • Remove some extra line breaks
  • Some articles don't display the headline
I'll take a look at a later time. This is very good for me already as it is.
aerodynamik is offline   Reply With Quote