Recipe: DER SPIEGEL?

ganymede · 02-20-2011, 04:18 AM

Is it possible to build a recipe for the SPIEGEL Magazine? At "m.spiegel.de/epaper.do" registered users get access to the SPIEGEL Magazine via iPhone (html, no pdf).

Gany

miwie · 02-21-2011, 01:50 AM

Look at the already existing recipe spiegelde. It works quite well.

ganymede · 02-21-2011, 01:53 AM

Quote:

Originally Posted by miwie

Look at the already existing recipe spiegelde. It works quite well.

it's not the magazine!

aerodynamik · 04-16-2011, 05:14 AM

Ganymede is right. There is Spiegel Online, and then there is the actual magazine. Some of the actual magazine articles make it online but I think this is limited. In addition, IMO the writing for the magazine and the online version differ a lot sometimes in quality

I was looking into a recipe for the magazine and since I have not a lot of experience with recipes I would be happy to get some ideas on how to tackle this one.

The layout of the online edition is very close to the actual printed magazine, i.e. page 1 is on 1.html, page 2 on 2.html, etc.
If there is a page with an ad, the html page exists and shows a page, but has no text-content, only the image of the page.

There is a table of content which looks like this (I replaced the German naming of the classes with English)

Spoiler:

In addition, on the bottom of every page there is a link to the next and previous page. This navigation skips pages that have only ads. From the table of content it looks also like ads are not directly linked.

I could not find any URLs that actually match the content, e.g. spiegel.de/..../deutschland instead of the page number, which would allow me to use the standard feeds approach of BasicRecipe.

Is there a recipe that already parses a page similar like this? I.e., with page-number URLs or a similar table of content layout where I could peak?

Thanks in advance

Starson17 · 04-17-2011, 08:37 AM

Quote:

Originally Posted by aerodynamik

The layout of the online edition is very close to the actual printed magazine, i.e. page 1 is on 1.html, page 2 on 2.html, etc.

One option would be to calculate the article pages from this simple structure. GoComics does something similar.

Quote:

There is a table of content which looks like this

If you wish to use this TOC, you'd use parse_index. Look at the API.

Quote:

In addition, on the bottom of every page there is a link to the next and previous page.

To deal with this, you need multipage. A search for that word here will give some sample code.

aerodynamik · 04-17-2011, 12:51 PM

Deleting my old comment, since completely irrelevant

I missed the most important information in this thread: "m.spiegel.de/epaper.do" (from ganymede's original post). This is a much simpler version, index on one page, no multipage articles, no multiple articles on one page

I can work on this less than an hour a day, but I should have something within a few days.

Thanks again for your help Starson, and sorry for the confusion.

aerodynamik · 04-19-2011, 05:52 PM

Here we go, a first recipe for the printed edition of Der Spiegel. You need a subscription to access it.

I tested it on my Kindle 3, looks very good. Would be great to get some more tests on other devices.

When you copy the script, replace the character ◆ with "& # 9670 ;" (remove spaces in quotes). My Kindle wasn't able to display this correctly so I just replaced with a horizontal rule.

Spoiler:

Code:

#!/usr/bin/env  python

__license__   = 'GPL v3'
__copyright__ = '2011, Nikolas Mangold <nmangold at gmail.com>'
'''
spiegel.de
'''
from calibre.web.feeds.news import BasicNewsRecipe
from calibre import strftime
from calibre import re

class DerSpiegel(BasicNewsRecipe):
    title                  = 'Der Spiegel'
    __author__             = 'Nikolas Mangold'
    description            = 'Der Spiegel, Printed Edition. Access to paid content.'
    publisher              = 'SPIEGEL-VERLAG RUDOLF AUGSTEIN GMBH & CO. KG'
    category               = 'news, politics, Germany'
    no_stylesheets         = True
    encoding               = 'cp1252'
    needs_subscription     = True
    remove_empty_feeds     = True
    delay                  = 1
    PREFIX                 = 'http://m.spiegel.de'
    INDEX                  = PREFIX + '/spiegel/print/epaper/index-heftaktuell.html'
    use_embedded_content   = False
    masthead_url = 'http://upload.wikimedia.org/wikipedia/en/thumb/1/17/Der_Spiegel_logo.svg/200px-Der_Spiegel_logo.svg.png'
    language               = 'de'
    publication_type       = 'magazine'
    extra_css              = ' body{font-family: Arial,Helvetica,sans-serif} '
    timefmt = '[%W/%Y]'
    empty_articles = ['Titelbild']
    preprocess_regexps = [
        (re.compile(r'<p>◆</p>', re.DOTALL|re.IGNORECASE), lambda match: '<hr>'),
        ]

    def get_browser(self):
        def has_login_name(form):
            try:
                form.find_control(name="f.loginName")
            except:
                return False
            else:
                return True

        br = BasicNewsRecipe.get_browser()
        if self.username is not None and self.password is not None:
            br.open(self.PREFIX + '/meinspiegel/login.html')
            br.select_form(predicate=has_login_name)
            br['f.loginName'    ] = self.username
            br['f.password'] = self.password
            br.submit()
        return br

    remove_tags_before =  dict(attrs={'class':'spArticleContent'})
    remove_tags_after  =  dict(attrs={'class':'spArticleCredit'})
        
    def parse_index(self):
        soup = self.index_to_soup(self.INDEX)

        cover = soup.find('img', width=248)
        if cover is not None:
            self.cover_url = cover['src']

        index = soup.find('dl')

        feeds = []
        for section in index.findAll('dt'):
            section_title = self.tag_to_string(section).strip()
            self.log('Found section ', section_title)

            articles = []
            for article in section.findNextSiblings(['dd','dt']):
                if article.name == 'dt':
                    break
                link = article.find('a')
                title = self.tag_to_string(link).strip()
                if title in self.empty_articles:
                    continue
                self.log('Found article ', title)
                url = self.PREFIX + link['href']
                articles.append({'title' : title, 'date' : strftime(self.timefmt), 'url' : url})
            feeds.append((section_title,articles))
        return feeds;

I am unhappy with this code

Code:

            
for article in section.findNextSiblings(['dd','dt']):
    if article.name == 'dt':
         break

with which I am trying to get all articles ('dd') underneath a section ('dt')

Code:

<dt>section 1</dt>
 <dd class="spFirst">article 1</dd>
 <dd>article 2</dd>
<dt>section 2</dt>
 <dd>article 3</dd>

findNextSiblings would give me all 'dd's over all sections, but I am sure there is some smarter and better way to resolve this

Is it okay to use a wikipedia image for the masthead image?

kovidgoyal · 04-19-2011, 08:06 PM

You can find dt first and then call find dd on each dt.

Using wikipedia images should be fine. Though it's better to use an image from the publisher so as not to load wikipedia's servers.

aerodynamik · 04-20-2011, 01:59 AM

Quote:

Originally Posted by kovidgoyal

You can find dt first and then call find dd on each dt.

Different than e.g. <ul><li><li></ul>, <dt> does not "include" <dd>.
When I iterate over all <dt>'s and on each of them then call findAll('dd') I get all dd included in the overall index:

Code:

       
for section in index.findAll('dt'):
     section_title = self.tag_to_string(section).strip()
     self.log('Found section ', section_title)

     articles = []
     for article in section.findAll('dd'):
        #lists all dd's, including the ones next to the ones listed below the current dt

How would I know I stumbled to the dd's in the next dt-section?

Regarding the masthead: all I could find on the publishers website is the corresponding online logo. To avoid confusion between Spiegel Online and Der Spiegel I would stick to the wikipedia logo for now. There is an SVG source that renders the logo, does this help?

Starson17 · 04-20-2011, 09:24 AM

Quote:

Originally Posted by aerodynamik

How would I know I stumbled to the dd's in the next dt-section?

What about something like this:

Code:

 
for section in index.findAll(name=['dt', 'dd']):
    if section.name == 'dt':
       {do something}
    if section.name == 'dd':
        {do something else}

kovidgoyal · 04-20-2011, 11:13 AM

If they're siblings then you essentially have to do something along the lines of what you did, i.e. keep track of the last seen dt and add dds to it.

Code:

current_section, current_articles = None, []
for x in findAll(['dt', 'dd']):
   if x.name == 'dt':
       if current_section and current_articles:
           sections.append((current_section, current_articles))
       current_section, current_articles = set.tag_to_string(x), []
   else:
      current_articles.append(...)

JanMB · 06-03-2012, 03:01 AM

Hi, I am a SPIEGEL subscriber. I am using calibre to download and to convert the print version for an e-book reader (mine is a Kindle 3). It has been working great for more than a year. Today the download didn't work. I got a failure report (see attachment). I suppose a change in the recipe might help. I have no idea how to do it. Can please anybody help? Thank you very much. Jan

aerodynamik · 06-03-2012, 03:56 AM

The website is not working. If you go to m.spiegel.de/epaper.do, you already get an error message. Thru the link "Der Spiegel" on the bottom right you can select the current issue (http://m.spiegel.de/spiegel/print/ep...x-2012-22.html), but this one also does not work.

Jan,
I don't have an account with Der Spiegel anymore. Can you find a working link on the website?

JanMB · 06-03-2012, 04:20 AM

Hi, it looks like their service is down. I was not able to open the older issues either. Should I write them an e-mail? When they fix the service again, I should be able to to retrieve the magazine.

JanMB · 06-04-2012, 11:49 AM

Yes, it was SPIEGEL's fault. Now it is working again.

02-20-2011, 04:18 AM	#1
ganymede Connoisseur Posts: 57 Karma: 10 Join Date: Nov 2009 Device: Kindle 3	Recipe: DER SPIEGEL? Is it possible to build a recipe for the SPIEGEL Magazine? At "m.spiegel.de/epaper.do" registered users get access to the SPIEGEL Magazine via iPhone (html, no pdf). Gany

04-16-2011, 05:14 AM	#4
aerodynamik Enthusiast Posts: 43 Karma: 136 Join Date: Mar 2011 Device: Kindle Paperwhite	Ganymede is right. There is Spiegel Online, and then there is the actual magazine. Some of the actual magazine articles make it online but I think this is limited. In addition, IMO the writing for the magazine and the online version differ a lot sometimes in quality I was looking into a recipe for the magazine and since I have not a lot of experience with recipes I would be happy to get some ideas on how to tackle this one. The layout of the online edition is very close to the actual printed magazine, i.e. page 1 is on 1.html, page 2 on 2.html, etc. If there is a page with an ad, the html page exists and shows a page, but has no text-content, only the image of the page. There is a table of content which looks like this (I replaced the German naming of the classes with English) Spoiler: Code: <ul> <li class=majorSection> <ul> <li class=article> <a href="http://wissen.spiegel.de/wissen/epaper/SP/2010/40/27.html" title="Artikel S. 27"> <span class="contentPage">27</span> <span class="minorSection">TERRORISMUS</span> <span class="header">Zweifel an ...</span> </a> </li> <li>...</li> </ul> </li> <li class=majorSection>...</li> </ul> In addition, on the bottom of every page there is a link to the next and previous page. This navigation skips pages that have only ads. From the table of content it looks also like ads are not directly linked. I could not find any URLs that actually match the content, e.g. spiegel.de/..../deutschland instead of the page number, which would allow me to use the standard feeds approach of BasicRecipe. Is there a recipe that already parses a page similar like this? I.e., with page-number URLs or a similar table of content layout where I could peak? Thanks in advance Last edited by aerodynamik; 04-16-2011 at 05:17 AM. Reason: Fixed code section and added spoiler section for readability

04-17-2011, 12:51 PM	#6
aerodynamik Enthusiast Posts: 43 Karma: 136 Join Date: Mar 2011 Device: Kindle Paperwhite	Deleting my old comment, since completely irrelevant I missed the most important information in this thread: "m.spiegel.de/epaper.do" (from ganymede's original post). This is a much simpler version, index on one page, no multipage articles, no multiple articles on one page I can work on this less than an hour a day, but I should have something within a few days. Thanks again for your help Starson, and sorry for the confusion. Last edited by aerodynamik; 04-19-2011 at 02:01 AM.

04-20-2011, 11:13 AM	#11
kovidgoyal creator of calibre Posts: 43,856 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	If they're siblings then you essentially have to do something along the lines of what you did, i.e. keep track of the last seen dt and add dds to it. Code: current_section, current_articles = None, [] for x in findAll(['dt', 'dd']): if x.name == 'dt': if current_section and current_articles: sections.append((current_section, current_articles)) current_section, current_articles = set.tag_to_string(x), [] else: current_articles.append(...)

06-03-2012, 04:20 AM	#14
JanMB Junior Member Posts: 5 Karma: 10 Join Date: Oct 2011 Device: Kindle	SPIEGEL error message Hi, it looks like their service is down. I was not able to open the older issues either. Should I write them an e-mail? When they fix the service again, I should be able to to retrieve the magazine.

02-21-2011, 01:50 AM	#2
miwie Connoisseur Posts: 76 Karma: 12 Join Date: Nov 2010 Device: Android, PB Pro 602	Look at the already existing recipe spiegelde. It works quite well.

04-19-2011, 08:06 PM	#8
kovidgoyal creator of calibre Posts: 43,856 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You can find dt first and then call find dd on each dt. Using wikipedia images should be fine. Though it's better to use an image from the publisher so as not to load wikipedia's servers.

06-03-2012, 03:56 AM	#13
aerodynamik Enthusiast Posts: 43 Karma: 136 Join Date: Mar 2011 Device: Kindle Paperwhite	The website is not working. If you go to m.spiegel.de/epaper.do, you already get an error message. Thru the link "Der Spiegel" on the bottom right you can select the current issue (http://m.spiegel.de/spiegel/print/ep...x-2012-22.html), but this one also does not work. Jan, I don't have an account with Der Spiegel anymore. Can you find a working link on the website?

06-04-2012, 11:49 AM	#15
JanMB Junior Member Posts: 5 Karma: 10 Join Date: Oct 2011 Device: Kindle	SPIEGEL fixed it Yes, it was SPIEGEL's fault. Now it is working again.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Spiegel.de: Kleine Nabelschau zu eBooks auf der Buchmesse	K-Thom	Deutsches Forum	14	10-16-2009 01:25 PM
Mystery and Crime Storm, Theodor W.: Der Spiegel des Cyprianus, german, v1, 14 Mar 2009	ravenne	ePub Books	0	03-14-2009 06:24 PM
Mystery and Crime Storm, Theodor W.: Der Spiegel des Cyprianus, german, v1, 14 Mar 2009	ravenne	Kindle Books	0	03-14-2009 06:23 PM
Mystery and Crime Storm, Theodor W.: Der Spiegel des Cyprianus, german, v1, 14 Mar 2009	ravenne	BBeB/LRF Books	0	03-14-2009 06:21 PM
Article on Plastic Logic in german magazine "Der Spiegel"	Manichean	News	1	09-18-2008 06:48 AM

Advert

Advert