View Full Version : Enhanced brand eins recipe


siebert
11-21-2010, 07:37 AM
Hi,

I took the liberty to enhance the existing brand eins recipe.

Here is my changelog:
NEW: The issue to download can be selected via the username field.
NEW: Add cover image.
NEW: Prevent that conversion date is appended to title.
NEW: Remove "This article was downloaded by calibre from..." section from bottom of each page.
FIXED: "brand eins" is written in lowercase.

And here is the recipe:
#!/usr/bin/env python
# -*- coding: utf-8 mode: python -*-

__license__ = 'GPL v3'
__copyright__ = '2010, Constantin Hofstetter <consti at consti.de>, Steffen Siebert <calibre at steffensiebert.de>'
__version__ = '0.96'

''' http://brandeins.de - Wirtschaftsmagazin '''
import re
import string
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.web.feeds.templates import Template, CLASS
from lxml.html.builder import HTML, HEAD, TITLE, STYLE, DIV, BODY, BR, A, HR, UL

class MyNavBarTemplate(Template):
"""
Same as calibre.web.feeds.templates.NavBarTemplate but without the
'This article was downloaded by calibre from...'
text at the bottom.
"""

def _generate(self, bottom, feed, art, number_of_articles_in_feed,
two_levels, url, __appname__, prefix='', center=True,
extra_css=None, style=None):
head = HEAD(TITLE('navbar'))
if style:
head.append(STYLE(style, type='text/css'))
if extra_css:
head.append(STYLE(extra_css, type='text/css'))

if prefix and not prefix.endswith('/'):
prefix += '/'
align = 'center' if center else 'left'

navbar = DIV(CLASS('calibre_navbar', 'calibre_rescale_70',
style='text-align:'+align))
if bottom:
if not url.startswith('file://'):
navbar.append(HR())
else:
next = 'feed_%d'%(feed+1) if art == number_of_articles_in_feed - 1 \
else 'article_%d'%(art+1)
up = '../..' if art == number_of_articles_in_feed - 1 else '..'
href = '%s%s/%s/index.html'%(prefix, up, next)
navbar.text = '| '
navbar.append(A('Next', href=href))
href = '%s../index.html#article_%d'%(prefix, art)
navbar.iterchildren(reversed=True).next().tail = ' | '
navbar.append(A('Section Menu', href=href))
href = '%s../../index.html#feed_%d'%(prefix, feed)
navbar.iterchildren(reversed=True).next().tail = ' | '
navbar.append(A('Main Menu', href=href))
if art > 0 and not bottom:
href = '%s../article_%d/index.html'%(prefix, art-1)
navbar.iterchildren(reversed=True).next().tail = ' | '
navbar.append(A('Previous', href=href))
navbar.iterchildren(reversed=True).next().tail = ' | '
if not bottom:
navbar.append(HR())

self.root = HTML(head, BODY(navbar))

class BrandEins(BasicNewsRecipe):

title = u'brand eins'
__author__ = 'Constantin Hofstetter'
description = u'Wirtschaftsmagazin'
publisher ='brandeins.de'
category = 'politics, business, wirtschaft, Germany'
use_embedded_content = False
lang = 'de-DE'
no_stylesheets = True
encoding = 'utf-8'
language = 'de'
publication_type = 'magazine'
needs_subscription = True
# Prevent that conversion date is appended to title
timefmt = ''

# 2 is the last full magazine (default)
# 1 is the newest (but not full)
# 3 is one before 2 etc.
# This value can be set via the username field.
default_issue = 2

keep_only_tags = [dict(name='div', attrs={'id':'theContent'}), dict(name='div', attrs={'id':'sidebar'}), dict(name='div', attrs={'class':'intro'}), dict(name='p', attrs={'class':'bodytext'}), dict(name='div', attrs={'class':'single_image'})]

'''
brandeins.de
'''

def __init__(self, options, log, progress_reporter):
""" Constructor. """
BasicNewsRecipe.__init__(self, options, log, progress_reporter)
self.navbar = MyNavBarTemplate()

def postprocess_html(self, soup,first):

# Move the image of the sidebar right below the h3
first_h3 = soup.find(name='div', attrs={'id':'theContent'}).find('h3')
for imgdiv in soup.findAll(name='div', attrs={'class':'single_image'}):
if len(first_h3.findNextSiblings('div', {'class':'intro'})) >= 1:
# first_h3.parent.insert(2, imgdiv)
first_h3.findNextSiblings('div', {'class':'intro'})[0].parent.insert(4, imgdiv)
else:
first_h3.parent.insert(2, imgdiv)

# Now, remove the sidebar
soup.find(name='div', attrs={'id':'sidebar'}).extract()

# Remove the rating-image (stars) from the h3
for img in first_h3.findAll(name='img'):
img.extract()

# Mark the intro texts as italic
for div in soup.findAll(name='div', attrs={'class':'intro'}):
for p in div.findAll('p'):
content = self.tag_to_string(p)
new_p = "<p><i>"+ content +"</i></p>"
p.replaceWith(new_p)

return soup

def get_cover(self, soup):
cover_url = None
cover_item = soup.find('div', attrs = {'class': 'cover_image'})
if cover_item:
cover_url = 'http://www.brandeins.de/' + cover_item.img['src']
return cover_url

def parse_index(self):
feeds = []

archive = "http://www.brandeins.de/archiv.html"

issue = self.default_issue
if self.username:
try:
issue = int(self.username)
except:
pass

soup = self.index_to_soup(archive)
latest_jahrgang = soup.findAll('div', attrs={'class': re.compile(r'\bjahrgang-latest\b') })[0].findAll('ul')[0]
pre_latest_issue = latest_jahrgang.findAll('a')[len(latest_jahrgang.findAll('a'))-issue]
url = pre_latest_issue.get('href', False)
# Get the title for the magazin - build it out of the title of the cover - take the issue and year;
self.title = "brand eins "+ re.search(r"(?P<date>\d\d\/\d\d\d\d)", pre_latest_issue.find('img').get('title', False)).group('date')
url = 'http://brandeins.de/'+url

# url = "http://www.brandeins.de/archiv/magazin/tierisch.html"
titles_and_articles = self.brand_eins_parse_latest_issue(url)
if titles_and_articles:
for title, articles in titles_and_articles:
feeds.append((title, articles))
return feeds

def brand_eins_parse_latest_issue(self, url):
soup = self.index_to_soup(url)
self.cover_url = self.get_cover(soup)
article_lists = [soup.find('div', attrs={'class':'subColumnLeft articleList'}), soup.find('div', attrs={'class':'subColumnRight articleList'})]

titles_and_articles = []
current_articles = []
chapter_title = "Editorial"
self.log('Found Chapter:', chapter_title)

# Remove last list of links (thats just the impressum and the 'gewinnspiel')
article_lists[1].findAll('ul')[len(article_lists[1].findAll('ul'))-1].extract()

for article_list in article_lists:
for chapter in article_list.findAll('ul'):
if len(chapter.findPreviousSiblings('h3')) >= 1:
new_chapter_title = string.capwords(self.tag_to_string(chapter.findPre viousSiblings('h3')[0]))
if new_chapter_title != chapter_title:
titles_and_articles.append([chapter_title, current_articles])
current_articles = []
self.log('Found Chapter:', new_chapter_title)
chapter_title = new_chapter_title
for li in chapter.findAll('li'):
a = li.find('a', href = True)
if a is None:
continue
title = self.tag_to_string(a)
url = a.get('href', False)
if not url or not title:
continue
url = 'http://brandeins.de/'+url
if len(a.parent.findNextSiblings('p')) >= 1:
description = self.tag_to_string(a.parent.findNextSiblings('p')[0])
else:
description = ''

self.log('\t\tFound article:', title)
self.log('\t\t\t', url)
self.log('\t\t\t', description)

current_articles.append({'title': title, 'url': url, 'description': description, 'date':''})
titles_and_articles.append([chapter_title, current_articles])
return titles_and_articles


Ciao,
Steffen

Consti
11-21-2010, 08:22 AM
Hi Steffen!

Thanks for the Info - I've pushed your changes into the Repository.

@all: The newest version of the script can be found here (including Steffens changes!):
https://github.com/consti/BrandEins-Recipe/raw/master/brandeins.recipe

Starson17
11-21-2010, 10:24 AM
NEW: Remove "This article was downloaded by calibre from..." section from bottom of each page.

I haven't looked at your site or recipe, but you should be aware that this feature is used by many people who have readers that can access the web. Removing it often decreases the value of a recipe.

siebert
11-21-2010, 11:22 AM
I haven't looked at your site or recipe, but you should be aware that this feature is used by many people who have readers that can access the web. Removing it often decreases the value of a recipe.

I don't get why you bother to create an offline copy of the content via calibre if you want to read it online via a browser?

Perhaps it makes more sense for other sites, but the brand eins recipe fetches the monthly published print magazine from the web online archive and the EPUB contains all the relevant content of the web pages, so I see no point in having a link on every single page and I doubt that brand eins would have them if they would provide an EPUB version of their magazine (which they currently don't) .

I would prefer to have a single notice with link at the beginning and/or the end of the EPUB file to give credit to calibre and refer to the source; so it would be perfect if a recipe could easily switch between "link on every page" and "link at beginning and end of EPUB" behavior.

Ciao,
Steffen

Starson17
11-21-2010, 01:02 PM
I don't get why you bother to create an offline copy of the content via calibre if you want to read it online via a browser?

1) You do realize that the recipe removes advertisements and other less relevant content, don't you?
2) In addition to removing advertisements, many recipes remove related links. I remove them when I write a recipe, but I may want to look at them for some articles.
3) I'm not always connected to the web.

siebert
11-22-2010, 04:20 AM
1) You do realize that the recipe removes advertisements and other less relevant content, don't you?


I don't remember seeing any ads in the brand eins archive, but it's possible that Adblock plus just hides them from me.

Apart from that I would consider the removal of ads as a feature.


2) In addition to removing advertisements, many recipes remove related links. I remove them when I write a recipe, but I may want to look at them for some articles.


My goal for the brand eins recipe is to create a substitute for the official EPUB version of the brand eins magazine, which doesn't exist yet (they only sell the printed magazine).

Fortunatly it's rather easy, as all content of back issues is available as html pages in their online archive.

The EPUB should be self contained, having all relevant content of the web pages (which hopefully have all relevant content of the printed magazine) included in the EPUB. What the recipe is removing is just the web framework for navigation etc. which is shown on the brand eins webpage, but not in the printed magazine, so it's neither necessary nor wanted in the EPUB either.

If everything interesting is included in the EPUB, there is no point having a link to the source webpage, as I wouldn't follow it because there is nothing to gain.

Of course there should be some credit to calibre included in the generated EPUB plus a link to the index web page we used to fetch the content, but this should be included only once at the beginning and/or the end of the EPUB , not on every single page.

Ciao,
Steffen

Starson17
11-22-2010, 11:45 AM
If everything interesting is included in the EPUB, there is no point having a link to the source webpage, as I wouldn't follow it because there is nothing to gain.
I won't try to convince you of my viewpoint, if you'll grant me the same. The issue isn't what you or I think is best, it's what is consistent and expected by other recipe users who run a Calibre builtin recipe. We can always customize the recipe to any result we like, and we can offer that customization to others by including the needed code and a note in the description/recipe comments of how to use it.

Consti
11-26-2010, 12:56 AM
I've reverted Steffens changes until further notice.
I have to look in the changes.. sorry for including them so fast.

I am in Beijing right now, so I'll look into it as soon as I am back home.

@steffen: Sorry for reverting the changes. Lets talk about it as soon as I am back (should be in one week or so :) )

fritzifratz
04-07-2011, 02:39 PM
Hi Steffen and Consti,

thanks for putting this recipe together. Unfortunately, I am facing problems using it. Whenever I try to pull the articles from the website using Calibre, I get the following error log. Can you advise?

File "site-packages\calibre\web\feeds\news.py", line 872, in build_index
File "c:\users\f\appdata\local\temp\calibre_0.7.53_tmp_u y9v4j\calibre_0.7.53_qykifp_recipes\recipe0.py", line 103, in parse_index
issue_list = soup.findAll('div', attrs={'class': 'tx-brandeinsmagazine-pi1'})[0].findAll('a')
IndexError: list index out of range

Thanks and Best Regards!

Consti
04-09-2011, 06:38 PM
Hello FritziFratz!

I'll take a look into the BrandEins Recipe tomorrow/today (this Sunday :) ).
I've not checked the recipe for a long time -
the source is available here:
https://github.com/consti/BrandEins-Recipe

https://github.com/consti/BrandEins-Recipe/raw/master/brandeins.recipe

I'll let you know what my findings were -

--
Consti

Consti
04-09-2011, 06:58 PM
I just managed to find time to test the BrandEins Recipe:
It works for me. Maybe the problem was that there wasn't a previous issue available (the current issue is only partially available, per default we select the previous issue. but if that is not available (e.g., it's january) it might break.

I've now (again, sorry for keeping you waiting, Steffen!) officially included his changes in the Recipe. I can live without the links at the bottom of each page (I've never noticed them on the Kindle-formatted ebooks anyway).

Thanks for your contributions (@Steffen), they really made the whole recipe a lot better!

@FritziFratz Let me know if the recipe works for you now. I am using the latest version of calibre and the version of the recipe bundled with it.

fritzifratz
04-10-2011, 04:45 AM
Hi Consti,

thanks a lot for checking. In parallel to this thread, I also posted my question in another thread of this forum. Steffen already helped me and the issue is resolved. The problem was a setting of my desktop firewall :-( Sorry that I bugged you with this.

See: http://www.mobileread.com/forums/showthread.php?t=114128

Thanks for your work on putting this recipe together and your quick reply, Consti. Have a great sunday!

siebert
04-11-2011, 04:58 AM
Maybe the problem was that there wasn't a previous issue available (the current issue is only partially available, per default we select the previous issue. but if that is not available (e.g., it's january) it might break.


This error was already fixed by me in the official calibre brand-eins recipe, see commit 7415: http://bazaar.launchpad.net/~kovid/calibre/trunk/revision/7415

Ciao,
Steffen