Enhanced brand eins recipe

siebert · 11-21-2010, 07:37 AM

Hi,

I took the liberty to enhance the existing brand eins recipe.

Here is my changelog:
NEW: The issue to download can be selected via the username field.
NEW: Add cover image.
NEW: Prevent that conversion date is appended to title.
NEW: Remove "This article was downloaded by calibre from..." section from bottom of each page.
FIXED: "brand eins" is written in lowercase.

And here is the recipe:

Code:

#!/usr/bin/env  python
# -*- coding: utf-8 mode: python -*-

__license__   = 'GPL v3'
__copyright__ = '2010, Constantin Hofstetter <consti at consti.de>, Steffen Siebert <calibre at steffensiebert.de>'
__version__   = '0.96'

''' http://brandeins.de - Wirtschaftsmagazin '''
import re
import string
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.web.feeds.templates import Template, CLASS
from lxml.html.builder import HTML, HEAD, TITLE, STYLE, DIV, BODY, BR, A, HR, UL

class MyNavBarTemplate(Template):
  """
  Same as calibre.web.feeds.templates.NavBarTemplate but without the
  'This article was downloaded by calibre from...'
  text at the bottom.
  """

  def _generate(self, bottom, feed, art, number_of_articles_in_feed,
                two_levels, url, __appname__, prefix='', center=True,
                extra_css=None, style=None):
    head = HEAD(TITLE('navbar'))
    if style:
      head.append(STYLE(style, type='text/css'))
    if extra_css:
      head.append(STYLE(extra_css, type='text/css'))

    if prefix and not prefix.endswith('/'):
      prefix += '/'
    align = 'center' if center else 'left'

    navbar = DIV(CLASS('calibre_navbar', 'calibre_rescale_70',
                       style='text-align:'+align))
    if bottom:
      if not url.startswith('file://'):
        navbar.append(HR())
    else:
      next = 'feed_%d'%(feed+1) if art == number_of_articles_in_feed - 1 \
          else 'article_%d'%(art+1)
      up = '../..' if art == number_of_articles_in_feed - 1 else '..'
      href = '%s%s/%s/index.html'%(prefix, up, next)
      navbar.text = '| '
      navbar.append(A('Next', href=href))
    href = '%s../index.html#article_%d'%(prefix, art)
    navbar.iterchildren(reversed=True).next().tail = ' | '
    navbar.append(A('Section Menu', href=href))
    href = '%s../../index.html#feed_%d'%(prefix, feed)
    navbar.iterchildren(reversed=True).next().tail = ' | '
    navbar.append(A('Main Menu', href=href))
    if art > 0 and not bottom:
      href = '%s../article_%d/index.html'%(prefix, art-1)
      navbar.iterchildren(reversed=True).next().tail = ' | '
      navbar.append(A('Previous', href=href))
    navbar.iterchildren(reversed=True).next().tail = ' | '
    if not bottom:
      navbar.append(HR())

    self.root = HTML(head, BODY(navbar))

class BrandEins(BasicNewsRecipe):

  title = u'brand eins'
  __author__ = 'Constantin Hofstetter'
  description = u'Wirtschaftsmagazin'
  publisher ='brandeins.de'
  category = 'politics, business, wirtschaft, Germany'
  use_embedded_content = False
  lang = 'de-DE'
  no_stylesheets = True
  encoding = 'utf-8'
  language = 'de'
  publication_type = 'magazine'
  needs_subscription = True
  # Prevent that conversion date is appended to title
  timefmt = ''

  # 2 is the last full magazine (default)
  # 1 is the newest (but not full)
  # 3 is one before 2 etc.
  # This value can be set via the username field.
  default_issue = 2

  keep_only_tags = [dict(name='div', attrs={'id':'theContent'}), dict(name='div', attrs={'id':'sidebar'}), dict(name='div', attrs={'class':'intro'}), dict(name='p', attrs={'class':'bodytext'}), dict(name='div', attrs={'class':'single_image'})]

  '''
  brandeins.de
  '''

  def __init__(self, options, log, progress_reporter):
    """ Constructor. """
    BasicNewsRecipe.__init__(self, options, log, progress_reporter)
    self.navbar = MyNavBarTemplate()
  
  def postprocess_html(self, soup,first):

    # Move the image of the sidebar right below the h3
    first_h3 = soup.find(name='div', attrs={'id':'theContent'}).find('h3')
    for imgdiv in soup.findAll(name='div', attrs={'class':'single_image'}):
      if len(first_h3.findNextSiblings('div', {'class':'intro'})) >= 1:
        # first_h3.parent.insert(2, imgdiv)
        first_h3.findNextSiblings('div', {'class':'intro'})[0].parent.insert(4, imgdiv)
      else:
        first_h3.parent.insert(2, imgdiv)

    # Now, remove the sidebar
    soup.find(name='div', attrs={'id':'sidebar'}).extract()

    # Remove the rating-image (stars) from the h3
    for img in first_h3.findAll(name='img'):
        img.extract()

    # Mark the intro texts as italic
    for div in soup.findAll(name='div', attrs={'class':'intro'}):
      for p in div.findAll('p'):
        content = self.tag_to_string(p)
        new_p = "<p><i>"+ content +"</i></p>"
        p.replaceWith(new_p)

    return soup

  def get_cover(self, soup):
    cover_url = None
    cover_item = soup.find('div', attrs = {'class': 'cover_image'})
    if cover_item:
      cover_url = 'http://www.brandeins.de/' + cover_item.img['src']
    return cover_url

  def parse_index(self):
    feeds = []

    archive = "http://www.brandeins.de/archiv.html"

    issue = self.default_issue
    if self.username:
      try:
        issue = int(self.username)
      except:
        pass

    soup = self.index_to_soup(archive)
    latest_jahrgang = soup.findAll('div', attrs={'class': re.compile(r'\bjahrgang-latest\b') })[0].findAll('ul')[0]
    pre_latest_issue = latest_jahrgang.findAll('a')[len(latest_jahrgang.findAll('a'))-issue]
    url = pre_latest_issue.get('href', False)
    # Get the title for the magazin - build it out of the title of the cover - take the issue and year;
    self.title = "brand eins "+ re.search(r"(?P<date>\d\d\/\d\d\d\d)", pre_latest_issue.find('img').get('title', False)).group('date')
    url = 'http://brandeins.de/'+url

    # url = "http://www.brandeins.de/archiv/magazin/tierisch.html"
    titles_and_articles = self.brand_eins_parse_latest_issue(url)
    if titles_and_articles:
      for title, articles in titles_and_articles:
        feeds.append((title, articles))
    return feeds

  def brand_eins_parse_latest_issue(self, url):
    soup = self.index_to_soup(url)
    self.cover_url = self.get_cover(soup)
    article_lists = [soup.find('div', attrs={'class':'subColumnLeft articleList'}), soup.find('div', attrs={'class':'subColumnRight articleList'})]

    titles_and_articles = []
    current_articles = []
    chapter_title = "Editorial"
    self.log('Found Chapter:', chapter_title)

    # Remove last list of links (thats just the impressum and the 'gewinnspiel')
    article_lists[1].findAll('ul')[len(article_lists[1].findAll('ul'))-1].extract()

    for article_list in article_lists:
      for chapter in article_list.findAll('ul'):
        if len(chapter.findPreviousSiblings('h3')) >= 1:
          new_chapter_title = string.capwords(self.tag_to_string(chapter.findPreviousSiblings('h3')[0]))
          if new_chapter_title != chapter_title:
            titles_and_articles.append([chapter_title, current_articles])
            current_articles = []
            self.log('Found Chapter:', new_chapter_title)
          chapter_title = new_chapter_title
        for li in chapter.findAll('li'):
          a = li.find('a', href = True)
          if a is None:
            continue
          title = self.tag_to_string(a)
          url = a.get('href', False)
          if not url or not title:
            continue
          url = 'http://brandeins.de/'+url
          if len(a.parent.findNextSiblings('p')) >= 1:
            description = self.tag_to_string(a.parent.findNextSiblings('p')[0])
          else:
            description = ''

          self.log('\t\tFound article:', title)
          self.log('\t\t\t', url)
          self.log('\t\t\t', description)

          current_articles.append({'title': title, 'url': url, 'description': description, 'date':''})
    titles_and_articles.append([chapter_title, current_articles])
    return titles_and_articles

Ciao,
Steffen

Consti · 11-21-2010, 08:22 AM

Hi Steffen!

Thanks for the Info - I've pushed your changes into the Repository.

@all: The newest version of the script can be found here (including Steffens changes!):
https://github.com/consti/BrandEins-...andeins.recipe

Starson17 · 11-21-2010, 10:24 AM

Quote:

Originally Posted by siebert

NEW: Remove "This article was downloaded by calibre from..." section from bottom of each page.

I haven't looked at your site or recipe, but you should be aware that this feature is used by many people who have readers that can access the web. Removing it often decreases the value of a recipe.

siebert · 11-21-2010, 11:22 AM

Quote:

Originally Posted by Starson17

I haven't looked at your site or recipe, but you should be aware that this feature is used by many people who have readers that can access the web. Removing it often decreases the value of a recipe.

I don't get why you bother to create an offline copy of the content via calibre if you want to read it online via a browser?

Perhaps it makes more sense for other sites, but the brand eins recipe fetches the monthly published print magazine from the web online archive and the EPUB contains all the relevant content of the web pages, so I see no point in having a link on every single page and I doubt that brand eins would have them if they would provide an EPUB version of their magazine (which they currently don't) .

I would prefer to have a single notice with link at the beginning and/or the end of the EPUB file to give credit to calibre and refer to the source; so it would be perfect if a recipe could easily switch between "link on every page" and "link at beginning and end of EPUB" behavior.

Ciao,
Steffen

Starson17 · 11-21-2010, 01:02 PM

Quote:

Originally Posted by siebert

I don't get why you bother to create an offline copy of the content via calibre if you want to read it online via a browser?

1) You do realize that the recipe removes advertisements and other less relevant content, don't you?
2) In addition to removing advertisements, many recipes remove related links. I remove them when I write a recipe, but I may want to look at them for some articles.
3) I'm not always connected to the web.

siebert · 11-22-2010, 04:20 AM

Quote:

Originally Posted by Starson17

1) You do realize that the recipe removes advertisements and other less relevant content, don't you?

I don't remember seeing any ads in the brand eins archive, but it's possible that Adblock plus just hides them from me.

Apart from that I would consider the removal of ads as a feature.

Quote:

2) In addition to removing advertisements, many recipes remove related links. I remove them when I write a recipe, but I may want to look at them for some articles.

My goal for the brand eins recipe is to create a substitute for the official EPUB version of the brand eins magazine, which doesn't exist yet (they only sell the printed magazine).

Fortunatly it's rather easy, as all content of back issues is available as html pages in their online archive.

The EPUB should be self contained, having all relevant content of the web pages (which hopefully have all relevant content of the printed magazine) included in the EPUB. What the recipe is removing is just the web framework for navigation etc. which is shown on the brand eins webpage, but not in the printed magazine, so it's neither necessary nor wanted in the EPUB either.

If everything interesting is included in the EPUB, there is no point having a link to the source webpage, as I wouldn't follow it because there is nothing to gain.

Of course there should be some credit to calibre included in the generated EPUB plus a link to the index web page we used to fetch the content, but this should be included only once at the beginning and/or the end of the EPUB , not on every single page.

Ciao,
Steffen

Starson17 · 11-22-2010, 11:45 AM

Quote:

Originally Posted by siebert

If everything interesting is included in the EPUB, there is no point having a link to the source webpage, as I wouldn't follow it because there is nothing to gain.

I won't try to convince you of my viewpoint, if you'll grant me the same. The issue isn't what you or I think is best, it's what is consistent and expected by other recipe users who run a Calibre builtin recipe. We can always customize the recipe to any result we like, and we can offer that customization to others by including the needed code and a note in the description/recipe comments of how to use it.

Consti · 11-26-2010, 12:56 AM

I've reverted Steffens changes until further notice.
I have to look in the changes.. sorry for including them so fast.

I am in Beijing right now, so I'll look into it as soon as I am back home.

@steffen: Sorry for reverting the changes. Lets talk about it as soon as I am back (should be in one week or so

)

fritzifratz · 04-07-2011, 02:39 PM

Hi Steffen and Consti,

thanks for putting this recipe together. Unfortunately, I am facing problems using it. Whenever I try to pull the articles from the website using Calibre, I get the following error log. Can you advise?

File "site-packages\calibre\web\feeds\news.py", line 872, in build_index
File "c:\users\f\appdata\local\temp\calibre_0.7.53_tmp_ uy9v4j\calibre_0.7.53_qykifp_recipes\recipe0.py", line 103, in parse_index
issue_list = soup.findAll('div', attrs={'class': 'tx-brandeinsmagazine-pi1'})[0].findAll('a')
IndexError: list index out of range

Thanks and Best Regards!

Consti · 04-09-2011, 06:38 PM

Hello FritziFratz!

I'll take a look into the BrandEins Recipe tomorrow/today (this Sunday

).
I've not checked the recipe for a long time -
the source is available here:
https://github.com/consti/BrandEins-Recipe

https://github.com/consti/BrandEins-...andeins.recipe

I'll let you know what my findings were -

--
Consti

Consti · 04-09-2011, 06:58 PM

I just managed to find time to test the BrandEins Recipe:
It works for me. Maybe the problem was that there wasn't a previous issue available (the current issue is only partially available, per default we select the previous issue. but if that is not available (e.g., it's january) it might break.

I've now (again, sorry for keeping you waiting, Steffen!) officially included his changes in the Recipe. I can live without the links at the bottom of each page (I've never noticed them on the Kindle-formatted ebooks anyway).

Thanks for your contributions (@Steffen), they really made the whole recipe a lot better!

@FritziFratz Let me know if the recipe works for you now. I am using the latest version of calibre and the version of the recipe bundled with it.

fritzifratz · 04-10-2011, 04:45 AM

Hi Consti,

thanks a lot for checking. In parallel to this thread, I also posted my question in another thread of this forum. Steffen already helped me and the issue is resolved. The problem was a setting of my desktop firewall :-( Sorry that I bugged you with this.

See: https://www.mobileread.com/forums/sho...d.php?t=114128

Thanks for your work on putting this recipe together and your quick reply, Consti. Have a great sunday!

siebert · 04-11-2011, 04:58 AM

Quote:

Originally Posted by Consti

Maybe the problem was that there wasn't a previous issue available (the current issue is only partially available, per default we select the previous issue. but if that is not available (e.g., it's january) it might break.

This error was already fixed by me in the official calibre brand-eins recipe, see commit 7415: http://bazaar.launchpad.net/~kovid/c.../revision/7415

Ciao,
Steffen

04-09-2011, 06:38 PM	#10
Consti Junior Member Posts: 7 Karma: 10 Join Date: Sep 2010 Device: Kindle	Hello FritziFratz! I'll take a look into the BrandEins Recipe tomorrow/today (this Sunday ). I've not checked the recipe for a long time - the source is available here: https://github.com/consti/BrandEins-Recipe https://github.com/consti/BrandEins-...andeins.recipe I'll let you know what my findings were - -- Consti Last edited by Consti; 04-09-2011 at 07:23 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Recipe works when mocked up as Python file, fails when converted to Recipe	ode	Recipes	7	09-04-2011 04:57 AM
Enhanced Photography ebook	Andrew Brooks	Self-Promotions by Authors and Publishers	0	11-04-2010 06:12 AM
Enhanced Firmware for V3	keng2000	HanLin eBook	12	04-12-2010 09:30 AM
Enhanced Editions	charleski	News	9	02-24-2010 10:07 AM
Enhanced Editions	STML	News	14	09-10-2009 08:51 PM

11-21-2010, 08:22 AM	#2
Consti Junior Member Posts: 7 Karma: 10 Join Date: Sep 2010 Device: Kindle	Hi Steffen! Thanks for the Info - I've pushed your changes into the Repository. @all: The newest version of the script can be found here (including Steffens changes!): https://github.com/consti/BrandEins-...andeins.recipe

11-26-2010, 12:56 AM	#8
Consti Junior Member Posts: 7 Karma: 10 Join Date: Sep 2010 Device: Kindle	I've reverted Steffens changes until further notice. I have to look in the changes.. sorry for including them so fast. I am in Beijing right now, so I'll look into it as soon as I am back home. @steffen: Sorry for reverting the changes. Lets talk about it as soon as I am back (should be in one week or so )

04-07-2011, 02:39 PM	#9
fritzifratz Junior Member Posts: 8 Karma: 10 Join Date: Apr 2011 Device: PRS-650B	Hi Steffen and Consti, thanks for putting this recipe together. Unfortunately, I am facing problems using it. Whenever I try to pull the articles from the website using Calibre, I get the following error log. Can you advise? File "site-packages\calibre\web\feeds\news.py", line 872, in build_index File "c:\users\f\appdata\local\temp\calibre_0.7.53_tmp_ uy9v4j\calibre_0.7.53_qykifp_recipes\recipe0.py", line 103, in parse_index issue_list = soup.findAll('div', attrs={'class': 'tx-brandeinsmagazine-pi1'})[0].findAll('a') IndexError: list index out of range Thanks and Best Regards!

04-09-2011, 06:58 PM	#11
Consti Junior Member Posts: 7 Karma: 10 Join Date: Sep 2010 Device: Kindle	I just managed to find time to test the BrandEins Recipe: It works for me. Maybe the problem was that there wasn't a previous issue available (the current issue is only partially available, per default we select the previous issue. but if that is not available (e.g., it's january) it might break. I've now (again, sorry for keeping you waiting, Steffen!) officially included his changes in the Recipe. I can live without the links at the bottom of each page (I've never noticed them on the Kindle-formatted ebooks anyway). Thanks for your contributions (@Steffen), they really made the whole recipe a lot better! @FritziFratz Let me know if the recipe works for you now. I am using the latest version of calibre and the version of the recipe bundled with it.

04-10-2011, 04:45 AM	#12
fritzifratz Junior Member Posts: 8 Karma: 10 Join Date: Apr 2011 Device: PRS-650B	Hi Consti, thanks a lot for checking. In parallel to this thread, I also posted my question in another thread of this forum. Steffen already helped me and the issue is resolved. The problem was a setting of my desktop firewall :-( Sorry that I bugged you with this. See: https://www.mobileread.com/forums/sho...d.php?t=114128 Thanks for your work on putting this recipe together and your quick reply, Consti. Have a great sunday!

Advert

Advert