04-16-2015, 02:20 PM | #16 |
Enthusiast
Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
|
Wirtschaftswoche Title Image broken since a while
Hey ho,
the code snippet to get the title image for Wirtschaftswoche is currently: Code:
cover_source = 'http://www.wiwo-shop.de/wirtschaftswoche/wirtschaftswoche-emagazin-p1952.html' [...] def get_cover_url(self): cover_source_soup = self.index_to_soup(self.cover_source) preview_image_div = cover_source_soup.find(attrs={'class':'container vorschau'}) return 'http://www.wiwo-shop.de'+preview_image_div.a.img['src'] (from: https://kaufhaus.handelsblatt.com/do...zin-p1952.html) Code:
<div class="carousel-inner"> <figure class="active item"> <img src="https://kdww.cekom.de/images/lrn/spacer.gif" style="background: transparent url(https://kdww.cekom.de/images//WW_titel_16-w454-h298-ar.jpg) center center no-repeat;" title="WirtschaftsWoche eMagazin" alt="WirtschaftsWoche eMagazin"> </figure> </div> or alternatively I take this: (from: http://www.wiwo.de) Code:
<div data-vr-zone="Das Aktuelle Heft" class="hcf-mcol-box"><div class="hcf-content hcf-mcol-box-content hcf-decorated-box"><div class="hcf-morewiwo-content" data-vr-contentbox=""><div class="hcf-wiwo-image"><a title="Wirtschaftswoche" target="_blank" href="http://abo.wiwo.de/"><img border="0" alt="Wirtschaftswoche" src="http://www.wiwo.de/images/wirtschaftswoche-cover-16-2015/10019036/46-formatOriginal.gif"/></a></div><div class="hcf-recent-wiwo"><h4 class="hcf-teaser-text">WirtschaftsWoche 16 vom 13.4.2015</h4> I would prefer the first source (slightly better resolution). However, ... in both cases I fail to make the neccessary adaptions to the soup section. Has anyone a hot tip for me? Thanks Hegi. |
04-16-2015, 11:55 PM | #17 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Something like
Code:
style = soup.find('img', alt='WirtschaftsWoche eMagazin', style=True)['style'] self.cover_url = style.partition('(')[-1].rpartition(')')[0] |
04-17-2015, 01:52 PM | #18 | |
Enthusiast
Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
|
Dear Kovid,
thanks for your quick reply. - However, as it appears, I still dont't get it right. I took your snippet and made from it this: Code:
cover_source = 'https://kaufhaus.handelsblatt.com/downloads/wirtschaftswoche-emagazin-p1952.html' def get_cover_url(self): cover_source_soup = self.index_to_soup(self.cover_source) style = soup.find('img', alt='WirtschaftsWoche eMagazin', style=True)['style'] self.cover_url = style.partition('(')[-1].rpartition(')')[0] return self.cover_url Quote:
As far as my limited understanding of these issues goes, I have to embedd this code somehow into an "def ... return ..." section. But maybe your suggestion was meant in a different way? Thanks again. Hegi. |
|
04-17-2015, 09:49 PM | #19 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Change
cover_source_soup = self.index_to_soup(self.cover_source) to soup = self.index_to_soup(self.cover_source) |
04-18-2015, 05:03 PM | #20 |
Enthusiast
Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
|
revised WirtschaftsWoche Online Recipe
Thanks again Kovid!
Although I am a great enthusiast for Indian Food, this soup stuff always seems to get me . Here's the revised recipe to be used for further updates: Code:
__license__ = 'GPL v3' __copyright__ = '2013, Armin Geller' ''' Fetch WirtschaftsWoche Online ''' import re #import time from calibre.web.feeds.news import BasicNewsRecipe class WirtschaftsWocheOnline(BasicNewsRecipe): title = u'WirtschaftsWoche Online' __author__ = 'Armin Geller' # Update AGE 2013-01-05; Modified by Hegi 2013-04-28 description = u'Wirtschaftswoche Online - basierend auf den RRS-Feeds von Wiwo.de' tags = 'Nachrichten, Blog, Wirtschaft' publisher = 'Verlagsgruppe Handelsblatt GmbH / Redaktion WirtschaftsWoche Online' category = 'business, economy, news, Germany' publication_type = 'weekly magazine' language = 'de_DE' oldest_article = 7 max_articles_per_feed = 100 simultaneous_downloads= 20 auto_cleanup = False no_stylesheets = True remove_javascript = True remove_empty_feeds = True # don't duplicate articles from "Schlagzeilen" / "Exklusiv" to other rubrics ignore_duplicate_articles = {'title', 'url'} # if you want to reduce size for an b/w or E-ink device, uncomment this: compress_news_images = True # compress_news_images_auto_size = 16 scale_news_images = (400,300) compress_news_images_max_size = 35 timefmt = ' [%a, %d %b %Y]' conversion_options = { 'smarten_punctuation' : True, 'authors' : publisher, 'publisher' : publisher } language = 'de_DE' encoding = 'UTF-8' cover_source = 'https://kaufhaus.handelsblatt.com/downloads/wirtschaftswoche-emagazin-p1952.html' masthead_url = 'http://www.wiwo.de/images/wiwo_logo/5748610/1-formatOriginal.png' def get_cover_url(self): soup = self.index_to_soup(self.cover_source) style = soup.find('img', alt='WirtschaftsWoche eMagazin', style=True)['style'] self.cover_url = style.partition('(')[-1].rpartition(')')[0] return self.cover_url # Insert ". " after "Place" in <span class="hcf-location-mark">Place</span> # If you use .epub format you could also do this as extra_css '.hcf-location-mark:after {content: ". "}' preprocess_regexps = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2))] extra_css = 'h1 {font-size: 1.6em; text-align: left} \ h2 {font-size: 1em; font-style: italic; font-weight: normal} \ h3 {font-size: 1.3em;text-align: left} \ h4, h5, h6, a {font-size: 1em;text-align: left} \ .hcf-caption {font-size: 1em;text-align: left; font-style: italic} \ .hcf-location-mark {font-style: italic}' keep_only_tags = [ dict(name='div', attrs={'class':['hcf-column hcf-column1 hcf-teasercontainer hcf-maincol']}), dict(name='div', attrs={'id':['contentMain']}) ] remove_tags = [ dict(name='div', attrs={'class':['hcf-link-block hcf-faq-open', 'hcf-article-related']}) ] feeds = [ (u'Schlagzeilen', u'http://www.wiwo.de/contentexport/feed/rss/schlagzeilen'), (u'Exklusiv', u'http://www.wiwo.de/contentexport/feed/rss/exklusiv'), #(u'Themen', u'http://www.wiwo.de/contentexport/feed/rss/themen'), # AGE no print version (u'Unternehmen', u'http://www.wiwo.de/contentexport/feed/rss/unternehmen'), (u'Finanzen', u'http://www.wiwo.de/contentexport/feed/rss/finanzen'), (u'Politik', u'http://www.wiwo.de/contentexport/feed/rss/politik'), (u'Erfolg', u'http://www.wiwo.de/contentexport/feed/rss/erfolg'), (u'Technologie', u'http://www.wiwo.de/contentexport/feed/rss/technologie'), #(u'Green-WiWo', u'http://green.wiwo.de/feed/rss/') # AGE no print version ] def print_version(self, url): main, sep, id = url.rpartition('/') return main + '/v_detail_tab_print/' + id CU Hegi. |
03-04-2018, 05:50 AM | #21 |
Enthusiast
Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
|
Wiwo.de Website Relaunch
Hey there,
there was a complete webite relaunch at wiwo.de a fortnight ago and now nothing really works anymore .... Grrrrrr! At least the feeds appear to be the same: Code:
feeds = [ (u'Schlagzeilen', u'http://www.wiwo.de/contentexport/feed/rss/schlagzeilen'), (u'Exklusiv', u'http://www.wiwo.de/contentexport/feed/rss/exklusiv'), #(u'Themen', u'http://www.wiwo.de/contentexport/feed/rss/themen'), # AGE no print version (u'Unternehmen', u'http://www.wiwo.de/contentexport/feed/rss/unternehmen'), (u'Finanzen', u'http://www.wiwo.de/contentexport/feed/rss/finanzen'), (u'Politik', u'http://www.wiwo.de/contentexport/feed/rss/politik'), (u'Erfolg', u'http://www.wiwo.de/contentexport/feed/rss/erfolg'), (u'Technologie', u'http://www.wiwo.de/contentexport/feed/rss/technologie'), #(u'Green-WiWo', u'http://green.wiwo.de/feed/rss/') # AGE no print version ] However, the major change seems to be, that there are no longer "print-versions" of the pages to be used for extracting. The code snip-that handled that was so far: Code:
def print_version(self, url): main, sep, id = url.rpartition('/') return main + '/v_detail_tab_print/' + id Any suggestions or hints are mostly welcome. Thanks Hegi. |
03-04-2018, 06:02 AM | #22 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Remove the entire print_version function.
|
03-04-2018, 06:31 AM | #23 |
Wizard
Posts: 1,161
Karma: 1404241
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
|
Hi hegi,
this is my actual version. Take what you need for your recipe. DD |
03-04-2018, 11:48 AM | #24 |
Enthusiast
Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
|
Thanks Kovid,
wasn't sure I can remove the whole lot. I'm a fair bit further down the road, but unfortunately not yet there ... Here are some of the issues I'm battling with: From you I got the code snippet that adds a "." after "hcf-location-mark" class: Code:
preprocess_regexps = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2))] Code:
preprocess_regexps = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2)), (re.compile(r'(<span class="c-overline c-overline--alternate u-uppercase u-letter-spacing u-margin-m c-overline--article">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + ': ' + match.group(2))] Just checked a few articles. The good news is, I can call all articles as "-all" version. E.g. the article being https://www.wiwo.de/erfolg/managemen.../21022938.html Then I'd need something analogue to the printversion thing to add this to the url making it ".../21022938-all.html". My best guess would be something like: Code:
def one_page_version(self, url): main, sep, tail = url.rpartition('.') return main + '-all.' + tail And then there is an annoying bit. I want to get rid of the embedded social media stuff. The code looks like this Code:
<div class="o-article__element"> <div class="c-socialshare u-margin-xxl "> <h3 class="c-socialshare__headline u-margin-xl u-font-bold u-font-m-sm"> Diesen Artikel teilen: </h3> <a class="ajaxify c-socials__item c-socials__item--facebook " title="Auf Facebook teilen" data-command='{"trackSocial": "Facebook", "socialshare": {"provider": "facebook", "url": "https://www.wiwo.de/21030160.html", "content": "", "providerUrl": "https://www.facebook.com/sharer/sharer.php?display=popup&u=https%3A%2F%2Fwww.wiwo.de%2F21030160.html%3Fshare%3Dfb"}}'> <span class="c-socials__icon c-socials__icon--facebook"> <span class="c-socials__icon--background"><i class="c-icon c-icon__socials c-icon--facebook"></i></span> </span> <span class="c-socials__text">Facebook</span> </a> <a class="ajaxify c-socials__item c-socials__item--twitter " title="Auf Twitter teilen" data-command='{"trackSocial": "Twitter", "socialshare": {"provider": "twitter", "url": "https://www.wiwo.de/21030160.html", "content": "", "providerUrl": "https://twitter.com/intent/tweet?url=https%3A%2F%2Fwww.wiwo.de%2F21030160.html%3Fshare%3Dtwitter&text=Versicherungsbranche: Axa%20will%20offenbar%20US-Konkurrenten%20XL%20Group%20kaufen&hashtags="}}'> <span class="c-socials__icon c-socials__icon--twitter"> <span class="c-socials__icon--background"><i class="c-icon c-icon__socials c-icon--twitter"></i></span> </span> <span class="c-socials__text">Twitter</span> </a> <a class="ajaxify c-socials__item c-socials__item--xing " title="Auf Xing teilen" data-command='{"trackSocial": "Xing", "socialshare": {"provider": "xing", "url": "https://www.wiwo.de/21030160.html", "content": "", "providerUrl": "https://www.xing-share.com/app/user?op=share;sc_p=xing-share;url=https%3A%2F%2Fwww.wiwo.de%2F21030160.html%3Fshare%3Dxing"}}'> <span class="c-socials__icon c-socials__icon--xing"> <span class="c-socials__icon--background"><i class="c-icon c-icon__socials c-icon--xing"></i></span> </span> <span class="c-socials__text">Xing</span> </a> <a class="ajaxify c-socials__item c-socials__item--whatsapp u-desktop-hidden hidden-md-up" title="Per Whatsapp teilen" data-command='{"trackSocial": "Whatsapp", "socialshare": {"provider": "whatsapp", "url": "https://www.wiwo.de/21030160.html", "content": "", "providerUrl": "whatsapp://send?text=Versicherungsbranche: Axa%20will%20offenbar%20US-Konkurrenten%20XL%20Group%20kaufen https%3A%2F%2Fwww.wiwo.de%2F21030160.html%3Fshare%3Dwhatsapp"}}'> <span class="c-socials__icon c-socials__icon--whatsapp"> <span class="c-socials__icon--background"><i class="c-icon c-icon__socials c-icon--whatsapp"></i></span> </span> <span class="c-socials__text">Whatsapp</span> </a> <a class="ajaxify c-socials__item c-socials__item--mail " title="Per Mail teilen" data-command='{"trackSocial": "Mail", "socialshare": {"provider": "mail", "url": "https://www.wiwo.de/21030160.html", "content": "", "providerUrl": "mailto: ?subject=Versicherungsbranche: Axa%20will%20offenbar%20US-Konkurrenten%20XL%20Group%20kaufen - WirtschaftsWoche&body=https%3A%2F%2Fwww.wiwo.de%2F21030160.html%3Fshare%3Dmail"}}'> <span class="c-socials__icon c-socials__icon--mail"> <span class="c-socials__icon--background"><i class="c-icon c-icon__socials c-icon--mail"></i></span> </span> <span class="c-socials__text">Mail</span> </a> </div> </div> Code:
remove_tags = [ dict(name='div', attrs={'class':['c-socialshare__headline', 'c-socials__item', 'c-pagination u-flex ajaxify', 'u-font-bold']}) Thanks folks! Hegi. |
03-04-2018, 01:34 PM | #25 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You should name the method print_version() not one_page_version()
Your remove_tags should have Code:
dict(attrs={'class': lambda x: x and 'c-socialshare' in x.split()}), |
03-13-2018, 04:20 PM | #26 |
Enthusiast
Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
|
Thanks Divingduck & Kovid,
... I'm getting better on this one. - And slowly I'm things turn out as neat as wanted. However this one thing is bugging me: In the source I have e.g. (lookout for the bold tags): Code:
<h2 class="c-headline c-headline--article u-margin-m"><span class="c-overline c-overline--alternate u-uppercase u-letter-spacing u-margin-m c-overline--article">Nikolai Gluschkow</span> Russischer Geschäftsmann tot in London entdeckt </h2> <div class="c-metadata u-margin-xl "> <div> </div> <time datetime="2018-03-13T19:38:19+01:00">13. März 2018</time> <span>, aktualisiert <time datetime="2018-03-13T19:40:57+01:00">13. März 2018, 19:40 Uhr</time> </span> <span class="c-metadata__source"> | Quelle: <a href="http://www.handelsblatt.com" target="_blank">Handelsblatt Online</a></span> </div> [...] <div class="o-article__content-element o-article__content-element--richtext"> <div class="u-richtext ajaxify" data-command='{"richtext": {}}'> <p><span class="hcf-location-mark">London</span>Ein mit dem 2013 verstorbenen Oligarchen Boris Beresowski befreundeter russischer Geschäftsmann ist in London tot aufgefunden worden. Nikolai Gluschkow sei nicht mehr am Leben, sagte Anwalt Andrej Borowkow am Dienstag russischen Medien. Er wisse aber nichts über die Umstände und den Zeitpunkt des Todes des 68-Jährigen.</p> </div> </div> Code:
preprocess_regexps = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2)), (re.compile(r'(<span class="c-overline c-overline--alternate u-uppercase u-letter-spacing u-margin-m c-overline--article">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + ': ' + match.group(2))] Any hints, as to what I'm doing wrong here? Thanks a lot Hegi |
03-17-2018, 10:56 AM | #27 |
Enthusiast
Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
|
Hey folks,
... sorry, but I'm somehow stuck on this. picking this tag just does not work with the regex: Code:
<span class="c-overline c-overline--alternate u-uppercase u-letter-spacing u-margin-m c-overline--article">Nikolai Gluschkow</span> I somehow got the impression, that this long list of tags with spaces is the problem ... but I got no clue as how to go about. Thanks a lot in advance. Hegi. |
03-18-2018, 06:53 AM | #28 |
Wizard
Posts: 1,161
Karma: 1404241
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
|
Sorry for my late answer, I was on a trip.
Can't really help. I use classes with spaces in naming too (see remove_tags statement in my last file). That works for my recipe. You can check with debug what exactly the recipe is doing. Set some print statements in your recipe and pipe all statements in a log file. Maybe this will help you to find out what is going wrong. I have recognize for WiWo that there are also some classes with spaces at the end of the string and / or combinations with more than one space in a class names. |
03-18-2018, 02:00 PM | #29 | |
Enthusiast
Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
|
Hi Divingduck,
... now this is really interesting. I retrieved the recipe with debugging info via the CLI as follows: Code:
ebook-convert ~/.config/calibre/custom_recipes/WirtschaftsWoche\ Online_1014.recipe .mobi \ --mobi-file-type=new --output-profile=kindle_pw --debug-pipeline calibre-debug Code:
<h2 class="c-headline c-headline--article u-margin-m"><span class="c-overline c-overline--alternate u-uppercase u-letter-spacing u-margin-m c-overline--article">Wandel kostet Milliarden</span> SUV und China sollen Audi wieder nach vorne bringen </h2> Code:
<h2 class="c-headline"><span class="c-overline">Wandel kostet Milliarden</span> SUV und China sollen Audi wieder nach vorne bringen </h2> This leads me to changing the preprocess_regexps as follows: Code:
preprocess_regexps = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2)), (re.compile(r'(<span class="c-overline">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + ': ' + match.group(2))] Code:
<h2 class="c-headline"><span class="c-overline">Wandel kostet Milliarden</span> SUV und China sollen Audi wieder nach vorne bringen </h2> However, I'm not sure what you mean by Quote:
Thanks again, anyway. Hegi. |
|
03-18-2018, 05:18 PM | #30 |
Wizard
Posts: 1,161
Karma: 1404241
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
|
This is a batch script I use for analyzing my recipes. I put recipe and script in one directory (it clean -up first the old run). After the recipe is finish you will find the print screen output in a log file and in folder debug the different conversion stages.
Code:
REM * Remove old debug directory rmdir debug. /s /q REM * Delete old recipe and log file del WirtschaftsWoche.epub del WirtschaftsWoche.log REM Run new recipe in debug mode and ebook-convert WirtschaftsWoche.recipe .epub -vv --debug-pipeline debug > WirtschaftsWoche.log print '*** my_variable_main --->:', my_variable_main This is helpful for checking the content you like to modify wether it include what you expect or if a selection will be found or not. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
LWN.net Weekly News recipe | davide125 | Recipes | 22 | 11-12-2014 09:44 PM |
Business Week Recipe duplicates | Mixx | Recipes | 0 | 09-16-2012 06:43 AM |
beam-ebooks.de: Recipe to download weekly new content? | Rince123 | Recipes | 0 | 01-02-2012 03:39 AM |
Recipe for Sunday Business Post - Ireland | anne.oneemas | Recipes | 15 | 12-13-2010 05:13 PM |
Recipe for Business Spectator (Australia) | RedDogInCan | Recipes | 1 | 12-01-2010 12:34 AM |