MobileRead Forums - View Single Post - Multipage questions (Sueddeutsche Magazin)

aerodynamik · 04-25-2011, 07:21 AM

Quote:

Originally Posted by miwie

Really nice work for "Süddeutsche Magazin"!

Though I cannot give any hints to the question itself let me suggest the following improvements:

Use of UTF-8 text for metadata (e.g. title) by prepending text with 'u' (and use Umlauts in the text istelf of course)
Set correct metadata for language by using something like conversion_options = {'language' : language}
Set publisher in metadata, e.g. like publisher = u'Magazin Verlagsgesellschaft / Süddeutsche Zeitung mbH / 81677 München'

+Karma!

Thanks for the feedback and the karma

I added the conversion options, the publisher and the UTF-8 text for title etc. with Umlauts.

I also took a look again at the comments in preprocess_html. Actually, the comments were still correct at when logging them. Apparently, they would really be modified (incorrectly?) after preprocess_html?

After removing the banner ad the only comment left was google_ads. Removing the comments as in the beautifulsoup documentation would not work, the comments would not be found. I found them and removed the comments with this code

Code:

            comments = next_article.findAll(text=re.compile('google_ad'))
            [comment.extract() for comment in comments]

This is my current version.

Spoiler:

Code:

#!/usr/bin/env  python

__license__   = 'GPL v3'
__copyright__ = '2011, Nikolas Mangold <nmangold at gmail.com>'
'''
sz-magazin.de
'''
from calibre.web.feeds.news import BasicNewsRecipe
from calibre import strftime
import re

class SueddeutscheZeitungMagazin(BasicNewsRecipe):
    title                  = u'Süddeutsche Zeitung Magazin'
    __author__             = 'Nikolas Mangold'
    description            = u'Süddeutsche Zeitung Magazin'
    publisher              = u'Magazin Verlagsgesellschaft / Süddeutsche Zeitung mbH / 81677 München'
    category               = 'Germany'
    no_stylesheets         = True
    encoding               = 'cp1252'
    remove_empty_feeds     = True
    delay                  = 1
    PREFIX                 = 'http://sz-magazin.sueddeutsche.de'
    INDEX                  = PREFIX + '/hefte'
    use_embedded_content   = False
    masthead_url = 'http://sz-magazin.sueddeutsche.de/img/general/logo.gif'
    language               = 'de'
    publication_type       = 'magazine'
    extra_css              = ' body{font-family: Arial,Helvetica,sans-serif} '
    timefmt = '%W %Y'

    conversion_options = {
                          'comment'          : description
                        , 'tags'             : category
                        , 'publisher'        : publisher
                        , 'language'         : language
                        , 'linearize_tables' : True
                        }

    remove_tags_before =  dict(attrs={'class':'vorspann'})
    remove_tags_after  =  dict(attrs={'id':'commentsContainer'})
    remove_tags = [dict(name='ul', attrs={'class':'textoptions'}),dict(name='div', attrs={'class':'BannerBug'}),dict(name='div', attrs={'id':'commentsContainer'}),dict(name='div', attrs={'class':'plugin-linkbox'})]
        
    def parse_index(self):
        feeds = []

        # determine current issue
        index = self.index_to_soup(self.INDEX)
        year_index = index.find('ul', attrs={'class':'hefte-jahre'})
        week_index = index.find('ul', attrs={'class':'heftindex'})
        year = self.tag_to_string(year_index.find('li')).strip()
        tmp = week_index.find('li').a
        week = self.tag_to_string(tmp)
        aktuelles_heft = self.PREFIX + tmp['href']

        # set cover
        self.cover_url = '{0}/img/hefte/thumbs_l/{1}{2}.jpg'.format(self.PREFIX,year,week)

        # find articles and add to main feed
        soup = self.index_to_soup(aktuelles_heft)
        content = soup.find('div',{'id':'maincontent'})
        mainfeed = 'SZ Magazin {0}/{1}'.format(week, year)
        articles = []
        for article in content.findAll('li'):
            txt = article.find('div',{'class':'text-holder'})
            if txt is None:
                continue
            link = txt.find('a')
            desc = txt.find('p')
            title = self.tag_to_string(link).strip()
            self.log('Found article ', title)
            url = self.PREFIX + link['href']
            articles.append({'title' : title, 'date' : strftime(self.timefmt), 'url' : url, 'desc' : desc})
        feeds.append((mainfeed,articles))

        return feeds;

    def preprocess_html(self, soup):
        # determine if multipage, if not bail out
        multipage = soup.find('ul',attrs={'class':'blaettern'})
        if multipage is None:
            return soup;
        
        # get all subsequent pages and delete multipage links
        next_pages = []
        for next in multipage.findAll('li'):
           if next.a is None:
               continue
           nexturl = next.a['href']
           nexttitle = self.tag_to_string(next).strip()
           next_pages.append((self.PREFIX + nexturl,nexttitle))
        multipage.extract()

        # extract article from subsequent pages and insert at end of first page article
        firstpage_article = soup.find('div',attrs={'id':'artikel'})
        position = len(firstpage_article.contents)
        offset = 0
        for url, title in next_pages:
            next_soup = self.index_to_soup(url)
            next_article = next_soup.find('div',attrs={'id':'artikel'})

            # remove banner ad
            banner = next_article.find('div',attrs={'class':'BannerBug'})
            if banner:
                banner.extract()

            # remove remaining HTML comments
            comments = next_article.findAll(text=re.compile('google_ad'))
            [comment.extract() for comment in comments]

            firstpage_article.insert(position + offset, next_article)
            offset += len(next_article.contents)

        return firstpage_article

The following could still be done

Image galleries would still need fixing, but the webpage has again at least two different ways to implement image galleries
add blogs and 'kolumnen'. Again blogs are differently formatted than 'kolumnen'
Remove some extra line breaks
Some articles don't display the headline

I'll take a look at a later time. This is very good for me already as it is.