11-19-2010, 03:22 PM | #1 |
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
Recipe for Zeit Abo EPUB download
Hi,
"Die Zeit" provides their current issue as EPUB download for subscribers. With the following recipe calibre can be used to download the EPUB file from the protected webpage. Ciao, Steffen Code:
#!/usr/bin/env python # -*- coding: utf-8 -*- __license__ = 'GPL v3' __copyright__ = '2010, Steffen Siebert <calibre at steffensiebert.de>' __docformat__ = 'restructuredtext de' """ Die Zeit EPUB """ import os, urllib2, zipfile, cookielib, re from calibre.web.feeds.news import BasicNewsRecipe from calibre.ptempfile import PersistentTemporaryFile class ZeitEPUBAbo(BasicNewsRecipe): title = u'Zeit EPUB Abo' description = u'Das EPUB Abo der Zeit' language = 'de' lang = 'de-DE' __author__ = 'Steffen Siebert' needs_subscription = True conversion_options = { 'no_default_epub_cover' : True } def build_index(self): cookie_jar = cookielib.LWPCookieJar() cookie_handler = urllib2.HTTPCookieProcessor(cookie_jar) auth_handler = urllib2.HTTPBasicAuthHandler() auth_handler.add_password(realm='ZEIT_online Angebote', uri="http://premium.zeit.de", user=self.username, passwd=self.password) opener = urllib2.build_opener(cookie_handler, auth_handler) urllib2.install_opener(opener) domain = "http://premium.zeit.de" url = domain + "/abovorteile/cgi-bin/_er_member/p4z.fpl?ER_Do=getUserData&ER_NextTemplate=login_ok" try: f = urllib2.urlopen(url) except urllib2.HTTPError: self.report_progress(0,_("Can't login to download issue")) raise ValueError('Failed to login, check your username and password') soup = self.index_to_soup(f.read()) link = soup.find('a', href=re.compile('.*Abo_RedirectTo=epaper.zeit.de/index_abovorteile.php&user=.*')) if not link: self.report_progress(0,_("Can't find first link.")) raise ValueError('Failed to find first link. Look for updated recipe.') url = domain + link["href"] try: f = urllib2.urlopen(url) except urllib2.HTTPError: self.report_progress(0,_("Can't login to download issue")) raise ValueError('Failed to login, check your username and password') soup = self.index_to_soup(f.read()) link = soup.find('a', href=re.compile('^http://contentserver.hgv-online.de/nodrm/fulfillment\\?distributor=zeit-online&orderid=zeit_online.*')) if not link: self.report_progress(0,_("Can't find second link.")) raise ValueError('Failed to find second link. Look for updated recipe.') url = link["href"] try: f = urllib2.urlopen(url) except urllib2.HTTPError: self.report_progress(0,_("Can't login to download issue")) raise ValueError('Failed to login, check your username and password') tmp = PersistentTemporaryFile(suffix='.epub') self.report_progress(0,_('downloading epub')) tmp.write(f.read()) tmp.close() zfile = zipfile.ZipFile(tmp.name, 'r') self.report_progress(0,_('extracting epub')) zfile.extractall(self.output_dir) tmp.close() index = os.path.join(self.output_dir, 'content.opf') self.report_progress(1,_('epub downloaded and extracted')) return index |
11-19-2010, 03:31 PM | #2 |
creator of calibre
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Can I suggest you use get_browser instead of urllib2. That way you wont need to do all the cookie handling and the users proxy settings will be automatically used.
|
11-20-2010, 06:21 AM | #3 |
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
Hi,
here is the new version using the mechanize module (which is very useful, but needs better documentation). Ciao, Steffen Code:
#!/usr/bin/env python # -*- coding: utf-8 mode: python -*- __license__ = 'GPL v3' __copyright__ = '2010, Steffen Siebert <calibre at steffensiebert.de>' __docformat__ = 'restructuredtext de' __version__ = '1.1' """ Die Zeit EPUB """ import os, urllib2, zipfile, re from calibre.web.feeds.news import BasicNewsRecipe from calibre.ptempfile import PersistentTemporaryFile class ZeitEPUBAbo(BasicNewsRecipe): title = u'Zeit EPUB Abo' description = u'Das EPUB Abo der Zeit' language = 'de' lang = 'de-DE' __author__ = 'Steffen Siebert' needs_subscription = True conversion_options = { 'no_default_epub_cover' : True } def build_index(self): domain = "http://premium.zeit.de" url = domain + "/abovorteile/cgi-bin/_er_member/p4z.fpl?ER_Do=getUserData&ER_NextTemplate=login_ok" browser = self.get_browser() browser.add_password("http://premium.zeit.de", self.username, self.password) try: browser.open(url) except urllib2.HTTPError: self.report_progress(0,_("Can't login to download issue")) raise ValueError('1: Failed to login, check your username and password') response = browser.follow_link(text="DIE ZEIT als E-Paper") response = browser.follow_link(url_regex=re.compile('^http://contentserver.hgv-online.de/nodrm/fulfillment\\?distributor=zeit-online&orderid=zeit_online.*')) tmp = PersistentTemporaryFile(suffix='.epub') self.report_progress(0,_('downloading epub')) tmp.write(response.read()) tmp.close() zfile = zipfile.ZipFile(tmp.name, 'r') self.report_progress(0,_('extracting epub')) zfile.extractall(self.output_dir) tmp.close() index = os.path.join(self.output_dir, 'content.opf') self.report_progress(1,_('epub downloaded and extracted')) return index |
11-28-2010, 12:01 PM | #4 |
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
As some users have issues with link navigation in the EPUB file created by calibre (see https://www.mobileread.com/forums/showthread.php?t=90005) while the original EPUB works fine, I patched calibre to handle downloaded EPUB files without modifying them.
As I'm new to bazaar, I couldn't find a way to export this patch alone, so it also contains my previous patch posted in https://www.mobileread.com/forums/sho...d.php?t=108656 Here is my updated recipe (works only with a patched calibre!): Code:
#!/usr/bin/env python # -*- coding: utf-8 mode: python -*- __license__ = 'GPL v3' __copyright__ = '2010, Steffen Siebert <calibre at steffensiebert.de>' __docformat__ = 'restructuredtext de' __version__ = '1.2' """ Die Zeit EPUB """ import os, urllib2, zipfile, re from calibre.web.feeds.news import BasicNewsRecipe class ZeitEPUBAbo(BasicNewsRecipe): title = u'Zeit EPUB Abo' description = u'Das EPUB Abo der Zeit' language = 'de' lang = 'de-DE' __author__ = 'Steffen Siebert' needs_subscription = True conversion_options = { 'no_default_epub_cover' : True } def build_index(self): domain = "http://premium.zeit.de" url = domain + "/abovorteile/cgi-bin/_er_member/p4z.fpl?ER_Do=getUserData&ER_NextTemplate=login_ok" epubName = os.path.join(self.output_dir, 'result.epub') browser = self.get_browser() browser.add_password(domain, self.username, self.password) try: browser.open(url) except urllib2.HTTPError: self.report_progress(0,_("Can't login to download issue")) raise ValueError('1: Failed to login, check your username and password') response = browser.follow_link(text="DIE ZEIT als E-Paper") response = browser.follow_link(url_regex=re.compile('^http://contentserver.hgv-online.de/nodrm/fulfillment\\?distributor=zeit-online&orderid=zeit_online.*')) self.report_progress(0,_('downloading epub')) f = open(epubName, "wb") f.write(response.read()) f.close() return epubName Code:
# Bazaar merge directive format 2 (Bazaar 0.90) # revision_id: siebert@steffensiebert.de-20101128162138-\ # lq8k3tkgv4im2f7o # target_branch: http://bazaar.launchpad.net/~kovid/calibre/trunk/ # testament_sha1: 070aa8a68cee7f89dd88061b913e61ac6490dc42 # timestamp: 2010-11-28 17:23:24 +0100 # base_revision_id: kovid@kovidgoyal.net-20101128023305-\ # 0ew07r4bzia4bb0t # # Begin patch === modified file 'src/calibre/ebooks/conversion/plumber.py' --- src/calibre/ebooks/conversion/plumber.py 2010-11-20 04:26:57 +0000 +++ src/calibre/ebooks/conversion/plumber.py 2010-11-28 16:21:38 +0000 @@ -838,6 +838,15 @@ self.dump_input(self.oeb, tdir) if self.abort_after_input_dump: return + oebExt = os.path.splitext(self.oeb)[1] + outExt = os.path.splitext(self.output)[1] + if outExt.lower() == oebExt.lower(): + self.log("Result is already in the correct format, no further processing necessary.") + shutil.copyfile(self.oeb, self.output) + self.log(self.output_fmt.upper(), 'output written to', self.output) + self.flush() + return + if self.input_fmt in ('recipe', 'downloaded_recipe'): self.opts_to_mi(self.user_metadata) if not hasattr(self.oeb, 'manifest'): === modified file 'src/calibre/web/feeds/__init__.py' --- src/calibre/web/feeds/__init__.py 2010-09-13 16:15:35 +0000 +++ src/calibre/web/feeds/__init__.py 2010-11-28 13:24:14 +0000 @@ -14,6 +14,11 @@ from calibre import entity_to_unicode, strftime from calibre.utils.date import dt_factory, utcnow, local_tz +FEED_NAME = 'feed%d.html' +''' Template for the feed index file. ''' +ARTICLE_NAME = 'feed%d_article%d.html' +''' Template for the article file. ''' + class Article(object): def __init__(self, id, title, url, author, summary, published, content): === modified file 'src/calibre/web/feeds/input.py' --- src/calibre/web/feeds/input.py 2010-09-17 18:02:43 +0000 +++ src/calibre/web/feeds/input.py 2010-11-28 16:21:38 +0000 @@ -102,8 +102,11 @@ disabled = getattr(ro, 'recipe_disabled', None) if disabled is not None: raise RecipeDisabled(disabled) - ro.download() + index = ro.download() self.recipe_object = ro + if index.endswith('.epub'): + # The result is already in EPUB format, no need to search for .opf file. + return os.path.abspath(index) for key, val in self.recipe_object.conversion_options.items(): setattr(opts, key, val) === modified file 'src/calibre/web/feeds/news.py' --- src/calibre/web/feeds/news.py 2010-11-04 22:26:10 +0000 +++ src/calibre/web/feeds/news.py 2010-11-28 13:24:14 +0000 @@ -21,7 +21,7 @@ from calibre.web import Recipe from calibre.ebooks.metadata.toc import TOC from calibre.ebooks.metadata import MetaInformation -from calibre.web.feeds import feed_from_xml, templates, feeds_from_index, Feed +from calibre.web.feeds import feed_from_xml, templates, feeds_from_index, Feed, FEED_NAME, ARTICLE_NAME from calibre.web.fetch.simple import option_parser as web2disk_option_parser from calibre.web.fetch.simple import RecursiveFetcher from calibre.utils.threadpool import WorkRequest, ThreadPool, NoResultsPending @@ -912,16 +912,10 @@ self.feed_objects = feeds for f, feed in enumerate(feeds): - feed_dir = os.path.join(self.output_dir, 'feed_%d'%f) - if not os.path.isdir(feed_dir): - os.makedirs(feed_dir) for a, article in enumerate(feed): if a >= self.max_articles_per_feed: break - art_dir = os.path.join(feed_dir, 'article_%d'%a) - if not os.path.isdir(art_dir): - os.makedirs(art_dir) try: url = self.print_version(article.url) except NotImplementedError: @@ -934,12 +928,12 @@ func, arg = (self.fetch_embedded_article, article) if self.use_embedded_content else \ ((self.fetch_obfuscated_article if self.articles_are_obfuscated \ else self.fetch_article), url) - req = WorkRequest(func, (arg, art_dir, f, a, len(feed)), + req = WorkRequest(func, (arg, self.output_dir, f, a, len(feed)), {}, (f, a), self.article_downloaded, self.error_in_article_download) req.feed = feed req.article = article - req.feed_dir = feed_dir + req.feed_dir = self.output_dir self.jobs.append(req) @@ -961,8 +955,7 @@ for f, feed in enumerate(feeds): html = self.feed2index(f,feeds) - feed_dir = os.path.join(self.output_dir, 'feed_%d'%f) - with open(os.path.join(feed_dir, 'index.html'), 'wb') as fi: + with open(os.path.join(self.output_dir, FEED_NAME%f), 'wb') as fi: fi.write(html) self.create_opf(feeds) self.report_progress(1, _('Feeds downloaded to %s')%index) @@ -1148,9 +1141,7 @@ ref.title = 'Masthead Image' opf.guide.append(ref) - manifest = [os.path.join(dir, 'feed_%d'%i) for i in range(len(feeds))] - manifest.append(os.path.join(dir, 'index.html')) - manifest.append(os.path.join(dir, 'index.ncx')) + manifest = [dir, os.path.join(dir, 'index.html'), os.path.join(dir, 'index.ncx')] # Get cover cpath = getattr(self, 'cover_path', None) @@ -1183,7 +1174,6 @@ f = feeds[num] for j, a in enumerate(f): if getattr(a, 'downloaded', False): - adir = 'feed_%d/article_%d/'%(num, j) auth = a.author if not auth: auth = None @@ -1192,14 +1182,15 @@ desc = None else: desc = self.description_limiter(desc) - entries.append('%sindex.html'%adir) + indexname = ARTICLE_NAME%(num, j) + entries.append(indexname) po = self.play_order_map.get(entries[-1], None) if po is None: self.play_order_counter += 1 po = self.play_order_counter - parent.add_item('%sindex.html'%adir, None, a.title if a.title else _('Untitled Article'), + parent.add_item(indexname, None, a.title if a.title else _('Untitled Article'), play_order=po, author=auth, description=desc) - last = os.path.join(self.output_dir, ('%sindex.html'%adir).replace('/', os.sep)) + last = os.path.join(self.output_dir, (indexname).replace('/', os.sep)) for sp in a.sub_pages: prefix = os.path.commonprefix([opf_path, sp]) relp = sp[len(prefix):] @@ -1226,7 +1217,7 @@ if len(feeds) > 1: for i, f in enumerate(feeds): - entries.append('feed_%d/index.html'%i) + entries.append(FEED_NAME%i) po = self.play_order_map.get(entries[-1], None) if po is None: self.play_order_counter += 1 @@ -1237,11 +1228,11 @@ desc = getattr(f, 'description', None) if not desc: desc = None - feed_index(i, toc.add_item('feed_%d/index.html'%i, None, + feed_index(i, toc.add_item(FEED_NAME%i, None, f.title, play_order=po, description=desc, author=auth)) else: - entries.append('feed_%d/index.html'%0) + entries.append(FEED_NAME%0) feed_index(0, toc) for i, p in enumerate(entries): @@ -1253,7 +1244,7 @@ opf.render(opf_file, ncx_file) def article_downloaded(self, request, result): - index = os.path.join(os.path.dirname(result[0]), 'index.html') + index = os.path.join(os.path.dirname(result[0]), ARTICLE_NAME%request.requestID) if index != result[0]: if os.path.exists(index): os.remove(index) @@ -1263,7 +1254,7 @@ article = request.article self.log.debug('Downloaded article:', article.title, 'from', article.url) article.orig_url = article.url - article.url = 'article_%d/index.html'%a + article.url = ARTICLE_NAME%request.requestID article.downloaded = True article.sub_pages = result[1][1:] self.jobs_done += 1 === modified file 'src/calibre/web/feeds/templates.py' --- src/calibre/web/feeds/templates.py 2010-08-29 18:39:20 +0000 +++ src/calibre/web/feeds/templates.py 2010-11-28 13:24:14 +0000 @@ -12,6 +12,7 @@ TABLE, TD, TR from calibre import preferred_encoding, strftime, isbytestring +from calibre.web.feeds import FEED_NAME, ARTICLE_NAME def CLASS(*args, **kwargs): # class is a reserved word in Python kwargs['class'] = ' '.join(args) @@ -92,7 +93,7 @@ for i, feed in enumerate(feeds): if feed: li = LI(A(feed.title, CLASS('feed', 'calibre_rescale_120', - href='feed_%d/index.html'%i)), id='feed_%d'%i) + href=FEED_NAME%i)), id='feed_%d'%i) ul.append(li) div = DIV( PT(IMG(src=masthead,alt="masthead"),style='text-align:center'), @@ -115,14 +116,14 @@ hr.tail = '| ' if f+1 < len(feeds): - link = A('Next section', href='../feed_%d/index.html'%(f+1)) + link = A('Next section', href=FEED_NAME%(f+1)) link.tail = ' | ' navbar.append(link) - link = A('Main menu', href="../index.html") + link = A('Main menu', href="index.html") link.tail = ' | ' navbar.append(link) if f > 0: - link = A('Previous section', href='../feed_%d/index.html'%(f-1)) + link = A('Previous section', href=FEED_NAME%(f-1)) link.tail = ' |' navbar.append(link) if top: @@ -203,20 +204,19 @@ navbar.append(BR()) navbar.append(BR()) else: - next = 'feed_%d'%(feed+1) if art == number_of_articles_in_feed - 1 \ - else 'article_%d'%(art+1) - up = '../..' if art == number_of_articles_in_feed - 1 else '..' - href = '%s%s/%s/index.html'%(prefix, up, next) + next = FEED_NAME%(feed+1) if art == number_of_articles_in_feed - 1 \ + else ARTICLE_NAME%(feed, art+1) + href = next navbar.text = '| ' navbar.append(A('Next', href=href)) - href = '%s../index.html#article_%d'%(prefix, art) + href = FEED_NAME%feed + '#article_%d'%art navbar.iterchildren(reversed=True).next().tail = ' | ' navbar.append(A('Section Menu', href=href)) - href = '%s../../index.html#feed_%d'%(prefix, feed) + href = 'index.html#feed_%d'%feed navbar.iterchildren(reversed=True).next().tail = ' | ' navbar.append(A('Main Menu', href=href)) if art > 0 and not bottom: - href = '%s../article_%d/index.html'%(prefix, art-1) + href = ARTICLE_NAME%(feed, art-1) navbar.iterchildren(reversed=True).next().tail = ' | ' navbar.append(A('Previous', href=href)) navbar.iterchildren(reversed=True).next().tail = ' | ' === modified file 'src/calibre/web/fetch/simple.py' --- src/calibre/web/fetch/simple.py 2010-11-04 19:35:23 +0000 +++ src/calibre/web/fetch/simple.py 2010-11-28 13:24:14 +0000 @@ -7,7 +7,7 @@ Fetch a webpage and its links recursively. The webpages are saved to disk in UTF-8 encoding with any charset declarations removed. ''' -import sys, socket, os, urlparse, re, time, copy, urllib2, threading, traceback +import sys, socket, os, urlparse, re, time, copy, urllib2, threading, traceback, hashlib from urllib import url2pathname, quote from httplib import responses from PIL import Image @@ -334,7 +334,7 @@ self.log.exception('Could not fetch image ', iurl) continue c += 1 - fname = ascii_filename('img'+str(c)) + fname = ascii_filename(hashlib.sha1(data).hexdigest()) if isinstance(fname, unicode): fname = fname.encode('ascii', 'replace') imgpath = os.path.join(diskpath, fname+'.jpg') # Begin bundle IyBCYXphYXIgcmV2aXNpb24gYnVuZGxlIHY0CiMKQlpoOTFBWSZTWVUyKlEACCX/gARUQABa7//3 f+dWjr////BgDY7oqd3rnd765strJWyvc1dQG+c+g925u73mp7e72wWR1m7bwkomk9GppqeTJTye gT0aTKDR6RtTJ6j1ABpo0CSQCaaBMpPFE9HqJ6j2qHlP1E9TEAMgAA0AIpknqnlMjI0BoDT0CMmQ 0ZAaZAJESZE0mCaTwSJsNI9SGIB6mjEaAZHqCKRU9NNE9A0mIZqnpNo1PUGhp6Q0AAAAkkE1MARi aaKeVPaj1TzRTJspppkADIBpGkEv1w2WNxWaTtqODDyMTGLp20v7zjejchsb/1Fd+LvEoZnKHHXr qIw488qO6ZysLu1yz2OnmZwGA8MMt2VEmctJAmMgC2oQiJCG1AtAQ5u50yO0s6Mhqj0WwKX4Y34Z HERfeWr0oSSkEHHScltFjwN3YL2izCabSTbYm0m2xeX4CStojy1KHKF0OrNcGuZZOuNjhq/xQXss x2ijKMuewtdlwgyWi2QOlHd0cxeaWDDsjm8WDzaVmxqthJycQUOqty4tCqPqP0/dP9f40z/cUn/p T2R83BcC0KOEvc6UdYA8njGpHHdtGJ9YRAuEIP2UYYS+JhPSjLF2yMi5B1IH16fdGZ92cxA8mGeQ eNxIq1YeYFh1RooGISjCgNXfcLJ0M35RAnXHOhBh40QUbtbOrjMpplOBJh8+Ajkz6dGnWyrnY5gE pOyRnQWKtjOmq6fBLbtsBEsOyJE6+enoXEUo4UZGMgxiNP3dNhVwQRAReIepyaDTqJn3m7P5nO2u sUCtiNVU3FYaXap7Wji6FtB3AzKcLXtVjmhNGaDHWv6oP0au8vHkjJWYT18LbcIfYaabPdtXIrdm SRn8K7C5r2o9kGNxPKFSNjWmjYG+5vYlbv9rFThGjZ+OvZzllHultghd67xxlrTlcEEbmVkkROXU jF9GEfD1dSt8qLmsDoKynK1s6L96nVeyu7PNRXTTR49/2cDZ03FR4IoZCcQJODAxEefniHDUSfxT iliNVJwISVsQpFcbspgc5x7PRy2oduHE4lvdGLe0UaES1DSgkQuUXNCpHjELMRUi5DFVvlAAIISv G5+FtoSjHEJKhRMPJ+tCYtEweSOS86kUxMJzKNi43FVm3+q+xUq9LAghWpEEf7aheclAtKixCnIk Sa9P+XdJQHLhS+naCWRj0Rka0Ml0b1S5trf4p0otpq1oWIAbbjEYdgcdxyLrtJeQ0hVgxYFi85JK dJfGDIaUg8mZnC9sRCpKyaa2E+OmW/ZrNZXq90p0lyL4RAm/czTqkpC4pSZRrCvpPEY7UOsxMdr9 tiJE44AQq6gz2mNUU2Z4sUHyxl3EYyRe8TXBsQmaRciHBC1ikbxgpvfdw28mGPakm1aOj0+ziDh8 tuuGVamjAihA22dOZM2YDFAx2GV1kJFlWj8SNr6U1GZkNMXxqSUnWjCIpDHcaFUox12lz9o1DyRx dAQqTcGSZSV+gD0jFdgJJ5cYlxLB+d5LkBr60lzUjMqyvfbEdUhmoL43xoee4ckmY8aipEDgaygl iXGqxqvkPmfyHpLCcheQWMgIFpKkhEsYnkHF51sFDIjAiZkDXtSUGAfSmuLLA0dXDguyyktLRLpB 5OLokYIVqpKk5ueA7dKvtYM4YpYB5gYEbIyIV8ukxRqDACwuxhWk3JU6i+BtLWccFoB5l2c99HPh Ubm2bq3sTQlPFa0LVSIWC3DC/E2ScBnNUv1p7gJKF/QVQuZaN+8eJJchUqkkai2w0JwrAtHUIVpm UOIzDyA4yfCRGdtfSsPjEHnoJAw0BpogMQlzpxoOCcM66biDU1QaFhriSExdqMVLsic7NoznyfKM GuWY5CqQzpMvHXDWLpwUULMl4NhV2E8jymg4RmjHzJK4gQHQL874U1kc8QILpQbdMthBtgD4FShE mEsNopLNgLh1xYtQvveNc84bnaEZspbkJjKUi2bA41wKrjAsMOms+kCzFhqLMGZmxgYwIIeOhNaD hxmzTVb3DZ0DqTFmPojDo1HAqhZMNBU1layBxPWQLR00CRuXooVG+bSb0ny+u8nzHCqtHWjiJHsN o6lgJshWLgKkCGSOoEo2hnQr0i/AZ3gbkl3jFiek4bEmMctMh7plj9YNjgdSi4noFnA0mdWAxSFh gviCP6gzsXgF0jaGk2N9/m/wRCF8gRT4yC+ILAMfkB9GYeXaV9zHDg1DYhK/5eAGBAHwBHu94I1z OBoCJAnq6fACV4BWLPHkQPrO7d5C7OrzztppiYzBbzNbspDXb6I+2kBlTbZcUoe3KoLkj8rvZQFN ss0L7BxT5ppERha98PUdA7joarRC9SFchN7CTynLlFuKHknDJxhyMCFfMSKacBYQSOB9PZkSWeIC tGPKaRLQOgPfQvfTiSDez5TBq8O9tbjWZp0cXFzBSkQPh4mP64/nnzMWQyF+czghAOQ+S+SF6FzO zV0MDqvA6n0bH+EBi4tKhmT0K4dEvmDIJCj0mjdPGI5jASTZaVvqRSy1HT4MhZRyI3tJHwGJtpmY klZvIGcBojnXrYN+1FpQVM1VRvnpN+ezYazHwWRn6NVA8nYXfUYljPaQTsGpkbmOqfBPR2IVMThI u7ELI2OOEaBhomZmbCxBgVCJyUCiRCSSJytjo0xehW0HfcaIWw4yMkkPGMDgOIGsAuA5gfGHUrGX wsn2b3YIVJVvpPnN/ezHU3jDS6+smWd9ZPFeQzFt0w6UkXJQMUHAZDD5yzLQ0eDzTm65YRbd1XQS nOOJtZFNsN3WLGLHK9iLRhxklkkdN+4NXeJcRje483ISF2cvdPONYrLpVQhpO5IQyGPD3SaVwHmg E4TUa67SZfKALShRqKeK4arZFWCQ5Uf7nvnsITP34IcurKo0WIXC8fCZwi2dsGGhgW2zsyLLOaos t+OgcJCkUKifJ1pqYJOs19HnRQv1Pu0AWV1ADpVFoBa2EHvtAJHALS9j1eni2CF4ZHU1+fv7jQx7 R7HRwDjy1EjcHPArYse2RQCJuOmp9xsHtd2w+Rx3iH1SFxzcG/s4AMKjT/hxyfOMDih3K1z8eyhC +RSRQTLJ0Gg2y0OHjRuStV5ukDq3AbbpTrmwx4okQh5JKx8EBVdMl463Zd0h9TI9+bNQvLfOjVoC DoGC82m1Xa7M3W5uUQItKhumsXJ1gEzEKNKmImBq4gamz3y8rhCvAeipUmLGjc0X8kjhfGBmzunV cYqmrHCdZqf+G1dp2FBRsTIW7OusRpeRBOXMcxciCfvL4o5dvdv2IvasCozTNjlWdJe0BxrkQ00b 6MYFimcDtApyj01AbQMIOYkBu380DbHmzSKIICiiBNindATehZNBjghetC5kceJWP+Y19TI4A/MC YtN8j2Nad6SqTRCYi6ZJYAOOrOyUgvSkwotuHj2OJvRyqRmTAsKgvTmyklb4l1lQQ2yEhswrS6Vg KXlvNVFgkJzkww6AGR2x2yDEnqpgJkLzkh6pQgq8fi9BBkjrM8UxfI0xZhyaFTz0rct5NHjxdtMb KPPOe9FffOHNhjWgPsOo9Weuati2JiGtI5tp/MEo0HQJhUGjAzK+6Yyh67APEAyMToBx2Hq50dp3 uE4RO7lUg9U4GvEGNjVwQSO3xZ6lYu04V2nEQqdA7k0aRLOwfzVKnOBqhGeKvNSU1Iml2p5KJFB+ qFN8MXDfBeWviQAmkGHsPAcVr5ojvdqdB4F2mnNy4b7HBMl5XGrVlPsFc8lSrDxCveMPTIEhYQfB qSmCEwlliks6i0ECH9VYW2EEZxjiTYxC9n032y9KyydJ6cI1rct5A2xvhv54g9TTa2iyl/UkLsOx DCoMhMhHrJ05zeCyD6eHKBxQR22bMhgRzpZAHYozUuTZXZ78NMsQyTxP1bilKbIqUIVLGMMCqhIV xcqnK+8qkKJEh1J6KCoCnIuTjuIDnQSFMtwOHVVigSJArEXkGYRBLC2BEhE5SAjDcEdDYrwAqtaX UxqU3BvIcFkwyI1yEqcLLEr6A39Nej8i39xBTzpn21oWEzPM1HPUxUkwyNpk9BodQPaNiBAAlshS TnLp0+ihlxQtCpOHhq4gNe14uuBL3RIfkQwQLz5Q0HKck1ZpsrAxaaPdbkm7Wq1KZWUH/F3JFOFC QVTIqUQ= Steffen |
02-25-2011, 03:15 PM | #5 |
Member
Posts: 18
Karma: 36
Join Date: Feb 2011
Device: Kindle
|
Improved version of the recipe
Hi,
May I suggest the following improved version of the "Die Zeit" recipe. General improvements include the correct handling of dashes and the download of the correct cover based on the front page of the newspaper; while improvements specifically for Kindle users include the removal of the empty left margin as well as the conversion of subscript numbers to non-subscripted but smaller numbers (the Kindle does not render the subscript unicode characters). The latter is important, for example, when articles talk about CO2. Here's the new recipe: Code:
#!/usr/bin/env python # -*- coding: utf-8 mode: python -*- __license__ = 'GPL v3' __copyright__ = '2010-2011, Steffen Siebert <calibre at steffensiebert.de>' __docformat__ = 'restructuredtext de' __version__ = '1.2' """ Die Zeit EPUB """ import os, urllib2, zipfile, re, string from calibre.web.feeds.news import BasicNewsRecipe from calibre.ptempfile import PersistentTemporaryFile from calibre import walk class ZeitEPUBAbo(BasicNewsRecipe): title = u'Die Zeit' description = u'Das EPUB Abo der Zeit (needs subscription)' language = 'de' lang = 'de-DE' __author__ = 'Steffen Siebert, revised by Tobias Isenberg (with some code by Kovid Goyal)' needs_subscription = True conversion_options = { 'no_default_epub_cover' : True, # fixing the wrong left margin 'mobi_ignore_margins' : True, } preprocess_regexps = [ # filtering for correct dashes (re.compile(r' - '), lambda match: ' – '), # regular "Gedankenstrich" (re.compile(r' -,'), lambda match: ' –,'), # "Gedankenstrich" before a comma (re.compile(r'(?<=\d)-(?=\d)'), lambda match: '–'), # number-number # filtering for unicode characters that are missing on the Kindle, # try to replace them with meaningful work-arounds (re.compile(u'\u2080'), lambda match: '<span style="font-size: 50%;">0</span>'), # subscript-0 (re.compile(u'\u2081'), lambda match: '<span style="font-size: 50%;">1</span>'), # subscript-1 (re.compile(u'\u2082'), lambda match: '<span style="font-size: 50%;">2</span>'), # subscript-2 (re.compile(u'\u2083'), lambda match: '<span style="font-size: 50%;">3</span>'), # subscript-3 (re.compile(u'\u2084'), lambda match: '<span style="font-size: 50%;">4</span>'), # subscript-4 (re.compile(u'\u2085'), lambda match: '<span style="font-size: 50%;">5</span>'), # subscript-5 (re.compile(u'\u2086'), lambda match: '<span style="font-size: 50%;">6</span>'), # subscript-6 (re.compile(u'\u2087'), lambda match: '<span style="font-size: 50%;">7</span>'), # subscript-7 (re.compile(u'\u2088'), lambda match: '<span style="font-size: 50%;">8</span>'), # subscript-8 (re.compile(u'\u2089'), lambda match: '<span style="font-size: 50%;">9</span>'), # subscript-9 ] def build_index(self): domain = "http://premium.zeit.de" url = domain + "/abovorteile/cgi-bin/_er_member/p4z.fpl?ER_Do=getUserData&ER_NextTemplate=login_ok" browser = self.get_browser() browser.add_password("http://premium.zeit.de", self.username, self.password) try: browser.open(url) except urllib2.HTTPError: self.report_progress(0,_("Can't login to download issue")) raise ValueError('Failed to login, check your username and password') response = browser.follow_link(text="DIE ZEIT als E-Paper") response = browser.follow_link(url_regex=re.compile('^http://contentserver.hgv-online.de/nodrm/fulfillment\\?distributor=zeit-online&orderid=zeit_online.*')) tmp = PersistentTemporaryFile(suffix='.epub') self.report_progress(0,_('downloading epub')) tmp.write(response.read()) tmp.close() zfile = zipfile.ZipFile(tmp.name, 'r') self.report_progress(0,_('extracting epub')) zfile.extractall(self.output_dir) tmp.close() index = os.path.join(self.output_dir, 'content.opf') self.report_progress(1,_('epub downloaded and extracted')) # doing regular expression filtering for path in walk('.'): (shortname, extension) = os.path.splitext(path) if extension.lower() in ('.html', '.htm', '.xhtml'): with open(path, 'r+b') as f: raw = f.read() raw = raw.decode('utf-8') for pat, func in self.preprocess_regexps: raw = pat.sub(func, raw) f.seek(0) f.truncate() f.write(raw.encode('utf-8')) # adding real cover self.report_progress(0,_('trying to download cover image (titlepage)')) self.download_cover() self.conversion_options["cover"] = self.cover_path return index # getting url of the cover def get_cover_url(self): try: inhalt = self.index_to_soup('http://www.zeit.de/inhalt') cover_url = inhalt.find('div', attrs={'class':'singlearchive clearfix'}).img['src'].replace('icon_','') except: cover_url = 'http://images.zeit.de/bilder/titelseiten_zeit/1946/001_001.jpg' return cover_url Tobias |
02-25-2011, 03:25 PM | #6 |
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
Hi Tobias,
as your changes are fixing the appearance of mobipocket (kindle) output, I think they don't belong on the input side (the recipe) but in the epub to mobipocket conversion (though I have no idea if and how these changes can be included there). My preference is to retrieve the existing epub without any modifications. Ciao, Steffen |
02-25-2011, 03:33 PM | #7 |
Member
Posts: 18
Karma: 36
Join Date: Feb 2011
Device: Kindle
|
Well, some modifications are of general nature (dashes, cover). The change for the left margin is a conversion setting only significant for the mobi oputput anyway. So I think these three are fine.
I agree, however, the subscript changes are only significant for the mobipocket output, but I also do not know how to do this only for the mobipocket conversion. I am not familiar enough with Python and the recipe code to be able to do this. If someone has a suggestion, please go ahead. However, for now, this solves a lot of the problems I had with the recipe and thus I wanted to post it. About your preference, I understand that you want the unmodified ePub but not all e-book readers understand this format, so the conversion approach also has its merit. Cheers, Tobias |
02-25-2011, 03:45 PM | #8 | |||
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
Why should we modify the dashes for epub readers which can display the original ones?
Why should we modify the custom epub cover which "Die Zeit" intentionally includes because it can be read on a mobile device while the text in the original cover is too small to read? Quote:
Quote:
Quote:
Ciao, Steffen |
|||
02-25-2011, 03:53 PM | #9 | ||
Member
Posts: 18
Karma: 36
Join Date: Feb 2011
Device: Kindle
|
Quote:
Quote:
Cheers, Tobias |
||
02-25-2011, 04:04 PM | #10 |
Member
Posts: 18
Karma: 36
Join Date: Feb 2011
Device: Kindle
|
Hi Steffen,
About the dashes, I would encourage you to try the new recipe version and compare it to your direct ePub version. I would be interested to know if you agree that the typography is improved. Cheers, Tobias |
02-26-2011, 06:57 PM | #11 |
Junior Member
Posts: 5
Karma: 10
Join Date: Feb 2011
Device: Kindle
|
Hi,
I registered at www.zeit.de and tried to use this recipe with the username and password I created at the site, however I get an error message saying that both the user and password are incorrect. How do I use this recipe? Herzlichen Dank, Alex |
02-27-2011, 05:12 AM | #12 | |
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
Quote:
Do you have a paid subscription for "Die Zeit"? Can you manually download the epub from the archive webpage? If you can't, you don't have the right type of account. Ciao, Steffen |
|
02-27-2011, 10:24 AM | #13 | |
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
Quote:
to be honest, it doesn't matter to me whether the change improves the typography or not. My point is that Die Zeit doesn't provide an archive of its epub issues, so after one week the current issue is gone forever from their servers. So I have to rely on calibre to collect an archive of all issues for me. As I can't never be sure that the modifications or improvements made by calibre won't break the content in the future, the only safe solution is to download and store the epub exactly as is and do any desired modification by converting the original to another format or copy (and keeping the original file). I created a patch for calibre together with a matching recipe which allows me to do just that, to download the unmodified epub with calibre. Unfortunatly David refused to incorporate the patch, even though that feature was on his to-do list. So in the end it doesn't matter what the Zeit recipe delivered with calibre does or doesn't do, as I won't use it. But I'm still convinced that optimizations necessary for a different target format (mobipocket) should be done during the conversion to that format and not in the download recipe which creates a completely different format (epub). Ciao, Steffen |
|
03-02-2011, 10:16 AM | #14 |
Member
Posts: 18
Karma: 36
Join Date: Feb 2011
Device: Kindle
|
Hi Steffen,
It indeed seems that we simply have entirely different goals with the recipe. I completely agree that an archive would be great, and can understand your point about having the patch that would allow Calibre to do this. In contrast, I use Calibre to just read the current version on my Kindle and was very grateful for your recipe (the version currently available in Calibre), because it allows me to do just that. The only problems were a bunch of issues, some of which the adjusted recipe solves. That's why I provided it here for others to use if they want. But I also agree with you that some parts are not implemented well and the output processing for the shortcomings for the mobi/Kindle format (subscripts) should indeed be handled in the respective module, maybe with the possibility to enable or disable it in the conversion options (like the left margin). Maybe David can add this functionality eventually. Cheers, Tobias |
03-02-2011, 10:24 AM | #15 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
"DIE ZEIT" im Online-Abo auch als ePub | ewy | Deutsches Forum | 142 | 12-21-2011 07:41 AM |
Google Reader Recipe hack - Download all unread insted of just starred | rollercoaster | Recipes | 82 | 06-17-2011 04:39 PM |
Passing parameters to recipe from "Schedule News Download" Window (e.g. for filtering | oecherprinte | Recipes | 6 | 05-13-2011 11:38 AM |
Error with adding font to EPUB news recipe | megabadd | Calibre | 2 | 01-11-2010 10:16 AM |
How to specify options for epub in a recipe? | kiklop74 | Calibre | 6 | 02-06-2009 03:43 PM |