Quote:
Originally Posted by miwie
Really nice work for "Süddeutsche Magazin"!
Though I cannot give any hints to the question itself let me suggest the following improvements: - Use of UTF-8 text for metadata (e.g. title) by prepending text with 'u' (and use Umlauts in the text istelf of course)
- Set correct metadata for language by using something like conversion_options = {'language' : language}
- Set publisher in metadata, e.g. like publisher = u'Magazin Verlagsgesellschaft / Süddeutsche Zeitung mbH / 81677 München'
+Karma!
|
Thanks for the feedback and the karma
I added the conversion options, the publisher and the UTF-8 text for title etc. with Umlauts.
I also took a look again at the comments in preprocess_html. Actually, the comments were still correct at when logging them. Apparently, they would really be modified (incorrectly?) after preprocess_html?
After removing the banner ad the only comment left was google_ads. Removing the comments as in
the beautifulsoup documentation would not work, the comments would not be found. I found them and removed the comments with this code
Code:
comments = next_article.findAll(text=re.compile('google_ad'))
[comment.extract() for comment in comments]
This is my current version.
Spoiler:
Code:
#!/usr/bin/env python
__license__ = 'GPL v3'
__copyright__ = '2011, Nikolas Mangold <nmangold at gmail.com>'
'''
sz-magazin.de
'''
from calibre.web.feeds.news import BasicNewsRecipe
from calibre import strftime
import re
class SueddeutscheZeitungMagazin(BasicNewsRecipe):
title = u'Süddeutsche Zeitung Magazin'
__author__ = 'Nikolas Mangold'
description = u'Süddeutsche Zeitung Magazin'
publisher = u'Magazin Verlagsgesellschaft / Süddeutsche Zeitung mbH / 81677 München'
category = 'Germany'
no_stylesheets = True
encoding = 'cp1252'
remove_empty_feeds = True
delay = 1
PREFIX = 'http://sz-magazin.sueddeutsche.de'
INDEX = PREFIX + '/hefte'
use_embedded_content = False
masthead_url = 'http://sz-magazin.sueddeutsche.de/img/general/logo.gif'
language = 'de'
publication_type = 'magazine'
extra_css = ' body{font-family: Arial,Helvetica,sans-serif} '
timefmt = '%W %Y'
conversion_options = {
'comment' : description
, 'tags' : category
, 'publisher' : publisher
, 'language' : language
, 'linearize_tables' : True
}
remove_tags_before = dict(attrs={'class':'vorspann'})
remove_tags_after = dict(attrs={'id':'commentsContainer'})
remove_tags = [dict(name='ul', attrs={'class':'textoptions'}),dict(name='div', attrs={'class':'BannerBug'}),dict(name='div', attrs={'id':'commentsContainer'}),dict(name='div', attrs={'class':'plugin-linkbox'})]
def parse_index(self):
feeds = []
# determine current issue
index = self.index_to_soup(self.INDEX)
year_index = index.find('ul', attrs={'class':'hefte-jahre'})
week_index = index.find('ul', attrs={'class':'heftindex'})
year = self.tag_to_string(year_index.find('li')).strip()
tmp = week_index.find('li').a
week = self.tag_to_string(tmp)
aktuelles_heft = self.PREFIX + tmp['href']
# set cover
self.cover_url = '{0}/img/hefte/thumbs_l/{1}{2}.jpg'.format(self.PREFIX,year,week)
# find articles and add to main feed
soup = self.index_to_soup(aktuelles_heft)
content = soup.find('div',{'id':'maincontent'})
mainfeed = 'SZ Magazin {0}/{1}'.format(week, year)
articles = []
for article in content.findAll('li'):
txt = article.find('div',{'class':'text-holder'})
if txt is None:
continue
link = txt.find('a')
desc = txt.find('p')
title = self.tag_to_string(link).strip()
self.log('Found article ', title)
url = self.PREFIX + link['href']
articles.append({'title' : title, 'date' : strftime(self.timefmt), 'url' : url, 'desc' : desc})
feeds.append((mainfeed,articles))
return feeds;
def preprocess_html(self, soup):
# determine if multipage, if not bail out
multipage = soup.find('ul',attrs={'class':'blaettern'})
if multipage is None:
return soup;
# get all subsequent pages and delete multipage links
next_pages = []
for next in multipage.findAll('li'):
if next.a is None:
continue
nexturl = next.a['href']
nexttitle = self.tag_to_string(next).strip()
next_pages.append((self.PREFIX + nexturl,nexttitle))
multipage.extract()
# extract article from subsequent pages and insert at end of first page article
firstpage_article = soup.find('div',attrs={'id':'artikel'})
position = len(firstpage_article.contents)
offset = 0
for url, title in next_pages:
next_soup = self.index_to_soup(url)
next_article = next_soup.find('div',attrs={'id':'artikel'})
# remove banner ad
banner = next_article.find('div',attrs={'class':'BannerBug'})
if banner:
banner.extract()
# remove remaining HTML comments
comments = next_article.findAll(text=re.compile('google_ad'))
[comment.extract() for comment in comments]
firstpage_article.insert(position + offset, next_article)
offset += len(next_article.contents)
return firstpage_article
The following could still be done
- Image galleries would still need fixing, but the webpage has again at least two different ways to implement image galleries

- add blogs and 'kolumnen'. Again blogs are differently formatted than 'kolumnen'

- Remove some extra line breaks
- Some articles don't display the headline
I'll take a look at a later time. This is very good for me already as it is.