MobileRead Forums - View Single Post

achims · 11-05-2011, 07:01 PM

Hi,

this week's Zeit edition includes an extra giveaway - the Beilage, which only appears every once in while. I have updated the recipe to include the Beilage, if it is offered by the Zeit Premium web page.

I have also tweaked a bit on the metadata of the Magazin (same applies to Beilage). Due to a bug in 'calibredb add', which does not process its title and author options, the recipe uses the file name as a fallback workaround. This unfortunately implies the addition of a random string in the title. I changed this now to be at the end of the title where it should disturb less. Note that if you have your calibre's ebook import option set to

Code:

(?P<title>.+) - (?P<author>[^_]+)

, then the Author tag will be set accordingly.
I also updated the recipe such that when this bug is resolved, the random string issue will be solved without further change in the recipe.

As a response to Tobias, I have changed the 'calibredb add' system call to include the 'duplicates' option. This way, calling the recipe several times (e.g. with different format switches) will create a new book entry for each call. I hope this resolves his problem.

Spoiler:

Code:

import re, zipfile, os
from calibre.ptempfile import PersistentTemporaryDirectory
from calibre.ptempfile import PersistentTemporaryFile
from urlparse import urlparse

GET_MOBI=False
GET_PDF=False
GET_AUDIO=False
GET_MAGAZIN=False
GET_BEILAGE=False

class ZeitPremiumAllFormats(BasicNewsRecipe):
    title          = u'Zeit Premium All Formats'
    description    = u'Lädt alle angebotenen E-Book Formate der aktuellen Woche aus dem Zeit Premium Bereich (kostenpflichtiges Abo): Die Zeit als epub, mobi, pdf und alle Audiofiles als zip. Sie werden in der Calibre Datenbank als ein einziges Buch eingetragen. Das Zeit Magazin und ggfls. die Beilage als pdf als je eigenständiges Buch. Aus technischen Gründen wird ein doppelter Bucheintrag der Zeit erstellt, der ein epub in einer abgewandelten Version erhält. Dieser Eintrag kann gelöscht werden. Alle Formate ausser epub können ein- oder ausgeschaltet werden. Anmerkung: Während der Umstellung auf eine neue Ausgabe (Mittwoch abends) werden nicht alle Formate gleichzeitig erneuert. Im Calibre Eintrag können dann die verschiedenen Formate zu verschiedenen Ausgaben gehören! Sollte schon ein Eintrag der Zeit der aktuellen Woche existieren, wird nicht aktualisiert, also vorher löschen! ___Getestet unter Unix___ - unter anderen Betriebssystemen funktioniert dieses recipe möglicherweise nicht.'
    __author__ = 'Achim Schumacher'
    language = 'de'
    needs_subscription = True
    conversion_options = {
        'no_default_epub_cover' : True,
    }

    #
    # Login process required:
    # Override BasicNewsRecipe.get_browser()
    #
    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        # new login process
        domain = "https://premium.zeit.de"
        response = br.open(domain)
        # Get rid of nested form
        response.set_data(re.sub('<div><form action=.*', '', response.get_data() ))
        br.set_response(response)
        br.select_form(nr=2)
        br.form['name']=self.username
        br.form['pass']=self.password
        br.submit()
        return br


    # Do not fetch news and convert them to E-Books.
    # Instead, download the epub directly from the site.
    # For this, override BasicNewsRecipe.build_index()
    #
    def build_index(self):
        browser = self.get_browser()

        # find the links
        epublink = browser.find_link(text_regex=re.compile('.*Ausgabe als Datei im ePub-Format.*'))
        mobilink = browser.find_link(text_regex=re.compile('.*Ausgabe als Datei im Mobi-Format.*'))
        pdflink = browser.find_link(text_regex=re.compile('.*Download der gesamten Ausgabe als PDF Datei.*'))
        audiolink = browser.find_link(text_regex=re.compile('.*Alle Audios der aktuellen ZEIT.*'))
        edition = (urlparse(pdflink.url)[2]).replace('/system/files/epaper/DZ/pdf/DZ_ePaper_','').replace('.pdf','')
        zm_url = urlparse(pdflink.base_url)[0]+'://'+urlparse(pdflink.base_url)[1]+''+(urlparse(pdflink.url)[2]).replace('DZ/pdf/DZ_ePaper','ZM/pdf/ZM_ePaper')
        bl_url = urlparse(pdflink.base_url)[0]+'://'+urlparse(pdflink.base_url)[1]+''+(urlparse(pdflink.url)[2]).replace('DZ/pdf/DZ_ePaper','BL/pdf/BL_ePaper')
        print "Found epub-link: %s" % epublink.url
        print "Found Mobi-link: %s" % mobilink.url
        print "Found pdf-link: %s" % pdflink.url
        print "Found audio-link: %s" % audiolink.url
        print "Will try ZM-link: %s" % zm_url
        print "Will try BL-link: %s" % bl_url
        print "This edition is: %s" % edition

        # The following part is from a recipe by Starson17
        #
        # It modifies build_index, which is the method that gets the 
        # masthead image and cover, parses the feed for articles, retrieves
        # the articles, removes tags from articles, etc. All of those steps 
        # ultimately produce a local directory structure that looks like an 
        # unzipped EPUB. 
        #
        # This part grabs the link to one EPUB, saves the EPUB locally,
        # extracts it, and passes the result back into the recipe system
        # as though all the other steps had been completed normally.
        #
        # This has to be done, even if one does not want to use this
        # calibre-modified epub. Otherwise, the recipe runs into an error.
        # This is the reason why there shows up a second Die Zeit entry
        # in calibre db.
        self.report_progress(0,_('downloading epub'))
        response = browser.follow_link(epublink)
        # We need two different directories for Die Zeit and Zeit Magazin
        DZdir = PersistentTemporaryDirectory()
        ZMdir = PersistentTemporaryDirectory()
        BLdir = PersistentTemporaryDirectory()
        epub_file = PersistentTemporaryFile(suffix='.epub',dir=DZdir)
        epub_file.write(response.read())
        epub_file.close()
        zfile = zipfile.ZipFile(epub_file.name, 'r')
        self.report_progress(0.1,_('extracting epub'))
        zfile.extractall(self.output_dir)
        epub_file.close()
        index = os.path.join(self.output_dir, 'content.opf')
        self.report_progress(0.2,_('epub downloaded and extracted'))

        #
        # Now, download the remaining files
        #
        print "output_dir is: %s" % self.output_dir
        print "DZdir is: %s" % DZdir
        print "ZMdir is: %s" % ZMdir
        print "BLdir is: %s" % BLdir

        if (GET_MOBI):
           self.report_progress(0.3,_('downloading mobi'))
           mobi_file = PersistentTemporaryFile(suffix='.mobi',dir=DZdir)
           browser.back()
           response = browser.follow_link(mobilink)
           mobi_file.write(response.read())
           mobi_file.close()

        if (GET_PDF):
           self.report_progress(0.4,_('downloading pdf'))
           pdf_file = PersistentTemporaryFile(suffix='.pdf',dir=DZdir)
           browser.back()
           response = browser.follow_link(pdflink)
           pdf_file.write(response.read())
           pdf_file.close()

        if (GET_AUDIO):
           self.report_progress(0.5,_('downloading audio'))
           audio_file = PersistentTemporaryFile(suffix='.mp3.zip',dir=DZdir)
           browser.back()
           response = browser.follow_link(audiolink)
           audio_file.write(response.read())
           audio_file.close()

        # Get all Die Zeit formats into Calibre's database
        self.report_progress(0.6,_('Adding Die Zeit to Calibre db'))
        cmd = "calibredb add -d -1 " + DZdir
        os.system(cmd)

        # Zeit Magazin has to be handled differently.
        # First, it has to be downloaded into it's own directory, since it
        # is a different book as Die Zeit.
        # Second, we know its url rather than its link.
        # Third, there is no Metadata present, so we need to give it
        # a proper name so that calibre will set Author and Title at import.
        # Unfortunately, the present solution includes a random part in the
        # name which after db import has to be manually resolved by the user.
        if (GET_MAGAZIN):
           self.report_progress(0.7,_('downloading ZM'))
           author = "Zeitverlag Gerd Bucerius GmbH und Co. KG"
           title="Zeit Magazin "+edition
           ZM_file = PersistentTemporaryFile(prefix='Zeit_Magazin_'+edition+'___',suffix='_-_'+author+'.pdf',dir=ZMdir)
           try:
              response = browser.open(zm_url)
              ZM_file.write(response.read())
              ZM_file.close()
              # Get Zeit Magazin into Calibre's database
              self.report_progress(0.8,_('Adding Zeit Magazin to Calibre db'))
              cmd = "calibredb add -a \""+author+"\" -t \""+title+"\" " + ZMdir
              print cmd
              os.system(cmd)
           except:
              self.report_progress(0.8,_('No Zeit Magazin found...'))

        # Zeit Beilage is technically the same as Zeit Magazin, but it is
        # not included in every edition. So, the use of try: is 
        # obligatory here.
        if (GET_BEILAGE):
           self.report_progress(0.9,_('downloading BL'))
           author = "Zeitverlag Gerd Bucerius GmbH und Co. KG"
           title="Zeit Beilage "+edition
           BL_file = PersistentTemporaryFile(prefix='Zeit_Beilage_'+edition+'___',suffix='_-_'+author+'.pdf',dir=BLdir)
           try:
              response = browser.open(bl_url)
              BL_file.write(response.read())
              BL_file.close()
              # Get Zeit Beilage into Calibre's database
              self.report_progress(0.9,_('Adding Zeit Beilage to Calibre db'))
              cmd = "calibredb add -a \""+author+"\" -t \""+title+"\" " + BLdir
              print cmd
              os.system(cmd)
           except:
              self.report_progress(0.9,_('No Zeit Beilage found...'))

        return index

Have fun
Achim