Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 02-20-2011, 04:18 AM   #1
ganymede
Connoisseur
ganymede began at the beginning.
 
Posts: 57
Karma: 10
Join Date: Nov 2009
Device: Kindle 3
Lightbulb Recipe: DER SPIEGEL?

Is it possible to build a recipe for the SPIEGEL Magazine? At "m.spiegel.de/epaper.do" registered users get access to the SPIEGEL Magazine via iPhone (html, no pdf).

Gany
ganymede is offline   Reply With Quote
Old 02-21-2011, 01:50 AM   #2
miwie
Connoisseur
miwie began at the beginning.
 
Posts: 76
Karma: 12
Join Date: Nov 2010
Device: Android, PB Pro 602
Look at the already existing recipe spiegelde. It works quite well.
miwie is offline   Reply With Quote
Advert
Old 02-21-2011, 01:53 AM   #3
ganymede
Connoisseur
ganymede began at the beginning.
 
Posts: 57
Karma: 10
Join Date: Nov 2009
Device: Kindle 3
Quote:
Originally Posted by miwie View Post
Look at the already existing recipe spiegelde. It works quite well.
it's not the magazine!
ganymede is offline   Reply With Quote
Old 04-16-2011, 05:14 AM   #4
aerodynamik
Enthusiast
aerodynamik doesn't litteraerodynamik doesn't litter
 
Posts: 43
Karma: 136
Join Date: Mar 2011
Device: Kindle Paperwhite
Ganymede is right. There is Spiegel Online, and then there is the actual magazine. Some of the actual magazine articles make it online but I think this is limited. In addition, IMO the writing for the magazine and the online version differ a lot sometimes in quality

I was looking into a recipe for the magazine and since I have not a lot of experience with recipes I would be happy to get some ideas on how to tackle this one.

The layout of the online edition is very close to the actual printed magazine, i.e. page 1 is on 1.html, page 2 on 2.html, etc.
If there is a page with an ad, the html page exists and shows a page, but has no text-content, only the image of the page.

There is a table of content which looks like this (I replaced the German naming of the classes with English)
Spoiler:
Code:
<ul>
	<li class=majorSection>
		<ul>
			<li class=article>
				<a href="http://wissen.spiegel.de/wissen/epaper/SP/2010/40/27.html" title="Artikel S. 27">
					<span class="contentPage">27</span>
					<span class="minorSection">TERRORISMUS</span>
					<span class="header">Zweifel an ...</span>
				</a>
			</li>
			<li>...</li>
		</ul>
	</li>
	<li class=majorSection>...</li>
</ul>


In addition, on the bottom of every page there is a link to the next and previous page. This navigation skips pages that have only ads. From the table of content it looks also like ads are not directly linked.

I could not find any URLs that actually match the content, e.g. spiegel.de/..../deutschland instead of the page number, which would allow me to use the standard feeds approach of BasicRecipe.

Is there a recipe that already parses a page similar like this? I.e., with page-number URLs or a similar table of content layout where I could peak?

Thanks in advance

Last edited by aerodynamik; 04-16-2011 at 05:17 AM. Reason: Fixed code section and added spoiler section for readability
aerodynamik is offline   Reply With Quote
Old 04-17-2011, 08:37 AM   #5
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by aerodynamik View Post
The layout of the online edition is very close to the actual printed magazine, i.e. page 1 is on 1.html, page 2 on 2.html, etc.
One option would be to calculate the article pages from this simple structure. GoComics does something similar.
Quote:
There is a table of content which looks like this
If you wish to use this TOC, you'd use parse_index. Look at the API.
Quote:
In addition, on the bottom of every page there is a link to the next and previous page.
To deal with this, you need multipage. A search for that word here will give some sample code.
Starson17 is offline   Reply With Quote
Advert
Old 04-17-2011, 12:51 PM   #6
aerodynamik
Enthusiast
aerodynamik doesn't litteraerodynamik doesn't litter
 
Posts: 43
Karma: 136
Join Date: Mar 2011
Device: Kindle Paperwhite
Deleting my old comment, since completely irrelevant

I missed the most important information in this thread: "m.spiegel.de/epaper.do" (from ganymede's original post). This is a much simpler version, index on one page, no multipage articles, no multiple articles on one page

I can work on this less than an hour a day, but I should have something within a few days.

Thanks again for your help Starson, and sorry for the confusion.

Last edited by aerodynamik; 04-19-2011 at 02:01 AM.
aerodynamik is offline   Reply With Quote
Old 04-19-2011, 05:52 PM   #7
aerodynamik
Enthusiast
aerodynamik doesn't litteraerodynamik doesn't litter
 
Posts: 43
Karma: 136
Join Date: Mar 2011
Device: Kindle Paperwhite
Der Spiegel Recipe (printed edition)

Here we go, a first recipe for the printed edition of Der Spiegel. You need a subscription to access it.

I tested it on my Kindle 3, looks very good. Would be great to get some more tests on other devices.

When you copy the script, replace the character ◆ with "& # 9670 ;" (remove spaces in quotes). My Kindle wasn't able to display this correctly so I just replaced with a horizontal rule.

Spoiler:
Code:
#!/usr/bin/env  python

__license__   = 'GPL v3'
__copyright__ = '2011, Nikolas Mangold <nmangold at gmail.com>'
'''
spiegel.de
'''
from calibre.web.feeds.news import BasicNewsRecipe
from calibre import strftime
from calibre import re

class DerSpiegel(BasicNewsRecipe):
    title                  = 'Der Spiegel'
    __author__             = 'Nikolas Mangold'
    description            = 'Der Spiegel, Printed Edition. Access to paid content.'
    publisher              = 'SPIEGEL-VERLAG RUDOLF AUGSTEIN GMBH & CO. KG'
    category               = 'news, politics, Germany'
    no_stylesheets         = True
    encoding               = 'cp1252'
    needs_subscription     = True
    remove_empty_feeds     = True
    delay                  = 1
    PREFIX                 = 'http://m.spiegel.de'
    INDEX                  = PREFIX + '/spiegel/print/epaper/index-heftaktuell.html'
    use_embedded_content   = False
    masthead_url = 'http://upload.wikimedia.org/wikipedia/en/thumb/1/17/Der_Spiegel_logo.svg/200px-Der_Spiegel_logo.svg.png'
    language               = 'de'
    publication_type       = 'magazine'
    extra_css              = ' body{font-family: Arial,Helvetica,sans-serif} '
    timefmt = '[%W/%Y]'
    empty_articles = ['Titelbild']
    preprocess_regexps = [
        (re.compile(r'<p>◆</p>', re.DOTALL|re.IGNORECASE), lambda match: '<hr>'),
        ]

    def get_browser(self):
        def has_login_name(form):
            try:
                form.find_control(name="f.loginName")
            except:
                return False
            else:
                return True

        br = BasicNewsRecipe.get_browser()
        if self.username is not None and self.password is not None:
            br.open(self.PREFIX + '/meinspiegel/login.html')
            br.select_form(predicate=has_login_name)
            br['f.loginName'    ] = self.username
            br['f.password'] = self.password
            br.submit()
        return br

    remove_tags_before =  dict(attrs={'class':'spArticleContent'})
    remove_tags_after  =  dict(attrs={'class':'spArticleCredit'})
        
    def parse_index(self):
        soup = self.index_to_soup(self.INDEX)

        cover = soup.find('img', width=248)
        if cover is not None:
            self.cover_url = cover['src']

        index = soup.find('dl')

        feeds = []
        for section in index.findAll('dt'):
            section_title = self.tag_to_string(section).strip()
            self.log('Found section ', section_title)

            articles = []
            for article in section.findNextSiblings(['dd','dt']):
                if article.name == 'dt':
                    break
                link = article.find('a')
                title = self.tag_to_string(link).strip()
                if title in self.empty_articles:
                    continue
                self.log('Found article ', title)
                url = self.PREFIX + link['href']
                articles.append({'title' : title, 'date' : strftime(self.timefmt), 'url' : url})
            feeds.append((section_title,articles))
        return feeds;


I am unhappy with this code
Code:
            
for article in section.findNextSiblings(['dd','dt']):
    if article.name == 'dt':
         break
with which I am trying to get all articles ('dd') underneath a section ('dt')
Code:
<dt>section 1</dt>
 <dd class="spFirst">article 1</dd>
 <dd>article 2</dd>
<dt>section 2</dt>
 <dd>article 3</dd>
findNextSiblings would give me all 'dd's over all sections, but I am sure there is some smarter and better way to resolve this

Is it okay to use a wikipedia image for the masthead image?

Last edited by aerodynamik; 04-19-2011 at 05:54 PM.
aerodynamik is offline   Reply With Quote
Old 04-19-2011, 08:06 PM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,856
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You can find dt first and then call find dd on each dt.

Using wikipedia images should be fine. Though it's better to use an image from the publisher so as not to load wikipedia's servers.
kovidgoyal is online now   Reply With Quote
Old 04-20-2011, 01:59 AM   #9
aerodynamik
Enthusiast
aerodynamik doesn't litteraerodynamik doesn't litter
 
Posts: 43
Karma: 136
Join Date: Mar 2011
Device: Kindle Paperwhite
Quote:
Originally Posted by kovidgoyal View Post
You can find dt first and then call find dd on each dt.
Different than e.g. <ul><li><li></ul>, <dt> does not "include" <dd>.
When I iterate over all <dt>'s and on each of them then call findAll('dd') I get all dd included in the overall index:
Code:
       
for section in index.findAll('dt'):
     section_title = self.tag_to_string(section).strip()
     self.log('Found section ', section_title)

     articles = []
     for article in section.findAll('dd'):
        #lists all dd's, including the ones next to the ones listed below the current dt
How would I know I stumbled to the dd's in the next dt-section?

Regarding the masthead: all I could find on the publishers website is the corresponding online logo. To avoid confusion between Spiegel Online and Der Spiegel I would stick to the wikipedia logo for now. There is an SVG source that renders the logo, does this help?
aerodynamik is offline   Reply With Quote
Old 04-20-2011, 09:24 AM   #10
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by aerodynamik View Post
How would I know I stumbled to the dd's in the next dt-section?
What about something like this:
Code:
 
for section in index.findAll(name=['dt', 'dd']):
    if section.name == 'dt':
       {do something}
    if section.name == 'dd':
        {do something else}
Starson17 is offline   Reply With Quote
Old 04-20-2011, 11:13 AM   #11
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,856
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
If they're siblings then you essentially have to do something along the lines of what you did, i.e. keep track of the last seen dt and add dds to it.

Code:
current_section, current_articles = None, []
for x in findAll(['dt', 'dd']):
   if x.name == 'dt':
       if current_section and current_articles:
           sections.append((current_section, current_articles))
       current_section, current_articles = set.tag_to_string(x), []
   else:
      current_articles.append(...)
kovidgoyal is online now   Reply With Quote
Old 06-03-2012, 03:01 AM   #12
JanMB
Junior Member
JanMB began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Oct 2011
Device: Kindle
SPIEGEL download failure

Hi, I am a SPIEGEL subscriber. I am using calibre to download and to convert the print version for an e-book reader (mine is a Kindle 3). It has been working great for more than a year. Today the download didn't work. I got a failure report (see attachment). I suppose a change in the recipe might help. I have no idea how to do it. Can please anybody help? Thank you very much. Jan
Attached Files
File Type: txt SPIEGEL download failure notice.txt (3.8 KB, 242 views)
JanMB is offline   Reply With Quote
Old 06-03-2012, 03:56 AM   #13
aerodynamik
Enthusiast
aerodynamik doesn't litteraerodynamik doesn't litter
 
Posts: 43
Karma: 136
Join Date: Mar 2011
Device: Kindle Paperwhite
The website is not working. If you go to m.spiegel.de/epaper.do, you already get an error message. Thru the link "Der Spiegel" on the bottom right you can select the current issue (http://m.spiegel.de/spiegel/print/ep...x-2012-22.html), but this one also does not work.

Jan,
I don't have an account with Der Spiegel anymore. Can you find a working link on the website?
aerodynamik is offline   Reply With Quote
Old 06-03-2012, 04:20 AM   #14
JanMB
Junior Member
JanMB began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Oct 2011
Device: Kindle
SPIEGEL error message

Hi, it looks like their service is down. I was not able to open the older issues either. Should I write them an e-mail? When they fix the service again, I should be able to to retrieve the magazine.
JanMB is offline   Reply With Quote
Old 06-04-2012, 11:49 AM   #15
JanMB
Junior Member
JanMB began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Oct 2011
Device: Kindle
SPIEGEL fixed it

Yes, it was SPIEGEL's fault. Now it is working again.
JanMB is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Spiegel.de: Kleine Nabelschau zu eBooks auf der Buchmesse K-Thom Deutsches Forum 14 10-16-2009 01:25 PM
Mystery and Crime Storm, Theodor W.: Der Spiegel des Cyprianus, german, v1, 14 Mar 2009 ravenne ePub Books 0 03-14-2009 06:24 PM
Mystery and Crime Storm, Theodor W.: Der Spiegel des Cyprianus, german, v1, 14 Mar 2009 ravenne Kindle Books 0 03-14-2009 06:23 PM
Mystery and Crime Storm, Theodor W.: Der Spiegel des Cyprianus, german, v1, 14 Mar 2009 ravenne BBeB/LRF Books 0 03-14-2009 06:21 PM
Article on Plastic Logic in german magazine "Der Spiegel" Manichean News 1 09-18-2008 06:48 AM


All times are GMT -4. The time now is 02:33 PM.


MobileRead.com is a privately owned, operated and funded community.