Recipe for Wirtschaftswoche / Wiwo.de (German Business Weekly) - Page 2

hegi · 04-16-2015, 02:20 PM

Hey ho,

the code snippet to get the title image for Wirtschaftswoche is currently:

Code:

    cover_source          = 'http://www.wiwo-shop.de/wirtschaftswoche/wirtschaftswoche-emagazin-p1952.html'
[...]

    def get_cover_url(self):
       cover_source_soup = self.index_to_soup(self.cover_source)
       preview_image_div = cover_source_soup.find(attrs={'class':'container vorschau'})
       return 'http://www.wiwo-shop.de'+preview_image_div.a.img['src']

However, they changed the site quite a bit. I have now two choices to gather this image from the web. Either I get it from this bit of http:

(from: https://kaufhaus.handelsblatt.com/do...zin-p1952.html)

Code:

			<div class="carousel-inner">
				<figure class="active item">
					<img src="https://kdww.cekom.de/images/lrn/spacer.gif" style="background: transparent url(https://kdww.cekom.de/images//WW_titel_16-w454-h298-ar.jpg) center center no-repeat;" title="WirtschaftsWoche eMagazin" alt="WirtschaftsWoche eMagazin">
									</figure>
							</div>

This would be: https://kdww.cekom.de/images//WW_titel_16-w454-h298-ar.jpg

or alternatively I take this:

(from: http://www.wiwo.de)

Code:

<div data-vr-zone="Das Aktuelle Heft" class="hcf-mcol-box"><div class="hcf-content hcf-mcol-box-content hcf-decorated-box"><div class="hcf-morewiwo-content" data-vr-contentbox=""><div class="hcf-wiwo-image"><a title="Wirtschaftswoche" target="_blank" href="http://abo.wiwo.de/"><img border="0" alt="Wirtschaftswoche" src="http://www.wiwo.de/images/wirtschaftswoche-cover-16-2015/10019036/46-formatOriginal.gif"/></a></div><div class="hcf-recent-wiwo"><h4 class="hcf-teaser-text">WirtschaftsWoche 16 vom 13.4.2015</h4>

Then I would need this: http://www.wiwo.de/images/wirtschaft...atOriginal.gif

I would prefer the first source (slightly better resolution). However, ... in both cases I fail to make the neccessary adaptions to the soup section. Has anyone a hot tip for me?

Thanks

Hegi.

kovidgoyal · 04-16-2015, 11:55 PM

Something like

Code:

style = soup.find('img', alt='WirtschaftsWoche eMagazin', style=True)['style']
self.cover_url = style.partition('(')[-1].rpartition(')')[0]

hegi · 04-17-2015, 01:52 PM

Dear Kovid,

thanks for your quick reply. - However, as it appears, I still dont't get it right.

I took your snippet and made from it this:

Code:

    cover_source          = 'https://kaufhaus.handelsblatt.com/downloads/wirtschaftswoche-emagazin-p1952.html'

    def get_cover_url(self):
        cover_source_soup = self.index_to_soup(self.cover_source)
        style = soup.find('img', alt='WirtschaftsWoche eMagazin', style=True)['style']
        self.cover_url = style.partition('(')[-1].rpartition(')')[0]
        return self.cover_url

But when I look into the log of the recipe I get:

Quote:

[Cover could not be downloaded]: global name 'soup' is not defined
Traceback (most recent call last):
File "site-packages/calibre/web/feeds/news.py", line 1263, in _download_cover
File "<string>", line 49, in get_cover_url
NameError: global name 'soup' is not defined

Line 49 is your "style = soup.find ... "line.

As far as my limited understanding of these issues goes, I have to embedd this code somehow into an "def ... return ..." section. But maybe your suggestion was meant in a different way?

Thanks again.

Hegi.

kovidgoyal · 04-17-2015, 09:49 PM

Change

cover_source_soup = self.index_to_soup(self.cover_source)

to

soup = self.index_to_soup(self.cover_source)

hegi · 04-18-2015, 05:03 PM

Thanks again Kovid!

Although I am a great enthusiast for Indian Food, this soup stuff always seems to get me

.

Here's the revised recipe to be used for further updates:

Code:

__license__   = 'GPL v3'
__copyright__ = '2013, Armin Geller'

'''
Fetch WirtschaftsWoche Online
'''
import re
#import time
from calibre.web.feeds.news import BasicNewsRecipe
class WirtschaftsWocheOnline(BasicNewsRecipe):
    title                 = u'WirtschaftsWoche Online'
    __author__            = 'Armin Geller' # Update AGE 2013-01-05; Modified by Hegi 2013-04-28
    description           = u'Wirtschaftswoche Online - basierend auf den RRS-Feeds von Wiwo.de'
    tags 	                = 'Nachrichten, Blog, Wirtschaft'
    publisher             = 'Verlagsgruppe Handelsblatt GmbH / Redaktion WirtschaftsWoche Online'
    category              = 'business, economy, news, Germany' 
    publication_type      = 'weekly magazine'
    language              = 'de_DE'
    oldest_article        = 7
    max_articles_per_feed = 100
    simultaneous_downloads= 20
    
    auto_cleanup          = False
    no_stylesheets        = True
    remove_javascript     = True
    remove_empty_feeds    = True

    # don't duplicate articles from "Schlagzeilen" / "Exklusiv" to other rubrics 
    ignore_duplicate_articles = {'title', 'url'}

    # if you want to reduce size for an b/w or E-ink device, uncomment this:
    compress_news_images  = True
    # compress_news_images_auto_size = 16
    scale_news_images     = (400,300)
    compress_news_images_max_size = 35

    timefmt               = ' [%a, %d %b %Y]'

    conversion_options    = { 'smarten_punctuation' : True,
			'authors'		  : publisher,
			'publisher'  	  : publisher }
    language              = 'de_DE'
    encoding              = 'UTF-8'
    cover_source          = 'https://kaufhaus.handelsblatt.com/downloads/wirtschaftswoche-emagazin-p1952.html'
    masthead_url          = 'http://www.wiwo.de/images/wiwo_logo/5748610/1-formatOriginal.png'

    def get_cover_url(self):
        soup = self.index_to_soup(self.cover_source) 
        style = soup.find('img', alt='WirtschaftsWoche eMagazin', style=True)['style']
        self.cover_url = style.partition('(')[-1].rpartition(')')[0]
        return self.cover_url

    # Insert ". " after "Place" in <span class="hcf-location-mark">Place</span>
    # If you use .epub format you could also do this as extra_css '.hcf-location-mark:after {content: ". "}' 
    preprocess_regexps    = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2))]

    extra_css      =  'h1 {font-size: 1.6em; text-align: left} \
                       h2 {font-size: 1em; font-style: italic; font-weight: normal} \
                       h3 {font-size: 1.3em;text-align: left} \
                       h4, h5, h6, a {font-size: 1em;text-align: left} \
                       .hcf-caption {font-size: 1em;text-align: left; font-style: italic} \
	             .hcf-location-mark {font-style: italic}' 

    keep_only_tags    = [
                          dict(name='div', attrs={'class':['hcf-column hcf-column1 hcf-teasercontainer hcf-maincol']}),
                          dict(name='div', attrs={'id':['contentMain']})
                        ]

    remove_tags = [
                    dict(name='div', attrs={'class':['hcf-link-block hcf-faq-open', 'hcf-article-related']})
                  ]

    feeds = [
              (u'Schlagzeilen', u'http://www.wiwo.de/contentexport/feed/rss/schlagzeilen'), 
              (u'Exklusiv', u'http://www.wiwo.de/contentexport/feed/rss/exklusiv'),
              #(u'Themen', u'http://www.wiwo.de/contentexport/feed/rss/themen'), # AGE no print version
              (u'Unternehmen', u'http://www.wiwo.de/contentexport/feed/rss/unternehmen'), 
              (u'Finanzen', u'http://www.wiwo.de/contentexport/feed/rss/finanzen'), 
              (u'Politik', u'http://www.wiwo.de/contentexport/feed/rss/politik'), 
              (u'Erfolg', u'http://www.wiwo.de/contentexport/feed/rss/erfolg'), 
              (u'Technologie', u'http://www.wiwo.de/contentexport/feed/rss/technologie'), 
              #(u'Green-WiWo', u'http://green.wiwo.de/feed/rss/') # AGE no print version
            ]
    def print_version(self, url):
        main, sep, id = url.rpartition('/')
        return main + '/v_detail_tab_print/' + id

CU

Hegi.

hegi · 03-04-2018, 05:50 AM

Hey there,

there was a complete webite relaunch at wiwo.de a fortnight ago and now nothing really works anymore .... Grrrrrr!

At least the feeds appear to be the same:

Code:

feeds = [
              (u'Schlagzeilen', u'http://www.wiwo.de/contentexport/feed/rss/schlagzeilen'), 
              (u'Exklusiv', u'http://www.wiwo.de/contentexport/feed/rss/exklusiv'),
              #(u'Themen', u'http://www.wiwo.de/contentexport/feed/rss/themen'), # AGE no print version
              (u'Unternehmen', u'http://www.wiwo.de/contentexport/feed/rss/unternehmen'), 
              (u'Finanzen', u'http://www.wiwo.de/contentexport/feed/rss/finanzen'), 
              (u'Politik', u'http://www.wiwo.de/contentexport/feed/rss/politik'), 
              (u'Erfolg', u'http://www.wiwo.de/contentexport/feed/rss/erfolg'), 
              (u'Technologie', u'http://www.wiwo.de/contentexport/feed/rss/technologie'), 
              #(u'Green-WiWo', u'http://green.wiwo.de/feed/rss/') # AGE no print version
            ]

So I do get the feeds grouped by category, but when clicking on them I just get a link-list of articles that does not work.

However, the major change seems to be, that there are no longer "print-versions" of the pages to be used for extracting. The code snip-that handled that was so far:

Code:

    def print_version(self, url):
        main, sep, id = url.rpartition('/')
        return main + '/v_detail_tab_print/' + id

My first "hands on" trial just to remove the "+ '/v_detail_tab_print/' " did not make things any better.

Any suggestions or hints are mostly welcome.

Thanks

Hegi.

kovidgoyal · 03-04-2018, 06:02 AM

Remove the entire print_version function.

Divingduck · 03-04-2018, 06:31 AM

Hi hegi,

this is my actual version. Take what you need for your recipe.

DD

hegi · 03-04-2018, 11:48 AM

Thanks Kovid,

wasn't sure I can remove the whole lot.

I'm a fair bit further down the road, but unfortunately not yet there ... Here are some of the issues I'm battling with:

From you I got the code snippet that adds a "." after "hcf-location-mark" class:

Code:

preprocess_regexps    = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2))]

I wanted to extend this to further add a ":" after "c-overline--article" class, but obviously it does not work the way I tried. - Why?

Code:

    preprocess_regexps    = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2)),
                        (re.compile(r'(<span class="c-overline c-overline--alternate u-uppercase u-letter-spacing u-margin-m c-overline--article">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + ': ' + match.group(2))]

Then there is something, that is similar to the print-version thing, but different. Some Articles are "multi-page" articles. However, in case they are, there is a separate "-all" version, where there is an "-all" before the ".html".
Just checked a few articles. The good news is, I can call all articles as "-all" version.

E.g. the article being https://www.wiwo.de/erfolg/managemen.../21022938.html

Then I'd need something analogue to the printversion thing to add this to the url making it ".../21022938-all.html".

My best guess would be something like:

Code:

    def one_page_version(self, url):
        main, sep, tail = url.rpartition('.')
        return main + '-all.' + tail

But it does not work, so any help is appreciated.

And then there is an annoying bit. I want to get rid of the embedded social media stuff. The code looks like this

Code:

<div class="o-article__element">
<div class="c-socialshare u-margin-xxl ">
<h3 class="c-socialshare__headline u-margin-xl u-font-bold u-font-m-sm">
Diesen Artikel teilen:
</h3>
<a class="ajaxify c-socials__item c-socials__item--facebook "
title="Auf Facebook teilen"
data-command='{"trackSocial": "Facebook", "socialshare": {"provider": "facebook", "url": "https://www.wiwo.de/21030160.html", "content": "", "providerUrl": "https://www.facebook.com/sharer/sharer.php?display=popup&amp;u=https%3A%2F%2Fwww.wiwo.de%2F21030160.html%3Fshare%3Dfb"}}'>
<span class="c-socials__icon c-socials__icon--facebook">
<span class="c-socials__icon--background"><i class="c-icon c-icon__socials c-icon--facebook"></i></span>
</span>
<span class="c-socials__text">Facebook</span>
</a>
<a class="ajaxify c-socials__item c-socials__item--twitter "
title="Auf Twitter teilen"
data-command='{"trackSocial": "Twitter", "socialshare": {"provider": "twitter", "url": "https://www.wiwo.de/21030160.html", "content": "", "providerUrl": "https://twitter.com/intent/tweet?url=https%3A%2F%2Fwww.wiwo.de%2F21030160.html%3Fshare%3Dtwitter&amp;text=Versicherungsbranche: Axa%20will%20offenbar%20US-Konkurrenten%20XL%20Group%20kaufen&amp;hashtags="}}'>
<span class="c-socials__icon c-socials__icon--twitter">
<span class="c-socials__icon--background"><i class="c-icon c-icon__socials c-icon--twitter"></i></span>
</span>
<span class="c-socials__text">Twitter</span>
</a>
<a class="ajaxify c-socials__item c-socials__item--xing "
title="Auf Xing teilen"
data-command='{"trackSocial": "Xing", "socialshare": {"provider": "xing", "url": "https://www.wiwo.de/21030160.html", "content": "", "providerUrl": "https://www.xing-share.com/app/user?op=share;sc_p=xing-share;url=https%3A%2F%2Fwww.wiwo.de%2F21030160.html%3Fshare%3Dxing"}}'>
<span class="c-socials__icon c-socials__icon--xing">
<span class="c-socials__icon--background"><i class="c-icon c-icon__socials c-icon--xing"></i></span>
</span>
<span class="c-socials__text">Xing</span>
</a>
<a class="ajaxify c-socials__item c-socials__item--whatsapp u-desktop-hidden hidden-md-up"
title="Per Whatsapp teilen"
data-command='{"trackSocial": "Whatsapp", "socialshare": {"provider": "whatsapp", "url": "https://www.wiwo.de/21030160.html", "content": "", "providerUrl": "whatsapp://send?text=Versicherungsbranche: Axa%20will%20offenbar%20US-Konkurrenten%20XL%20Group%20kaufen https%3A%2F%2Fwww.wiwo.de%2F21030160.html%3Fshare%3Dwhatsapp"}}'>
<span class="c-socials__icon c-socials__icon--whatsapp">
<span class="c-socials__icon--background"><i class="c-icon c-icon__socials c-icon--whatsapp"></i></span>
</span>
<span class="c-socials__text">Whatsapp</span>
</a>
<a class="ajaxify c-socials__item c-socials__item--mail "
title="Per Mail teilen"
data-command='{"trackSocial": "Mail", "socialshare": {"provider": "mail", "url": "https://www.wiwo.de/21030160.html", "content": "", "providerUrl": "mailto: ?subject=Versicherungsbranche: Axa%20will%20offenbar%20US-Konkurrenten%20XL%20Group%20kaufen - WirtschaftsWoche&amp;body=https%3A%2F%2Fwww.wiwo.de%2F21030160.html%3Fshare%3Dmail"}}'>
<span class="c-socials__icon c-socials__icon--mail">
<span class="c-socials__icon--background"><i class="c-icon c-icon__socials c-icon--mail"></i></span>
</span>
<span class="c-socials__text">Mail</span>
</a>
</div>
</div>

But the remove-tags in my recepie don't catch it:

Code:

    remove_tags = [
                    dict(name='div', attrs={'class':['c-socialshare__headline', 'c-socials__item', 'c-pagination u-flex ajaxify', 'u-font-bold']})

If I manage thesse issues, we are almost there for a new version of the recepie to share ...

Thanks folks!

Hegi.

kovidgoyal · 03-04-2018, 01:34 PM

You should name the method print_version() not one_page_version()

Your remove_tags should have

Code:

dict(attrs={'class': lambda x: x and 'c-socialshare' in x.split()}),

hegi · 03-13-2018, 04:20 PM

Thanks Divingduck & Kovid,

... I'm getting better on this one. - And slowly I'm things turn out as neat as wanted. However this one thing is bugging me:

In the source I have e.g. (lookout for the bold tags):

Code:

<h2 class="c-headline c-headline--article u-margin-m"><span
class="c-overline c-overline--alternate u-uppercase u-letter-spacing u-margin-m c-overline--article">Nikolai Gluschkow</span> Russischer Geschäftsmann tot in London entdeckt
</h2>
<div class="c-metadata u-margin-xl ">
<div>
</div>
<time
datetime="2018-03-13T19:38:19+01:00">13. März 2018</time>
<span>, aktualisiert
<time datetime="2018-03-13T19:40:57+01:00">13. März 2018, 19:40 Uhr</time>
</span>
<span class="c-metadata__source"> | Quelle: <a href="http://www.handelsblatt.com"
target="_blank">Handelsblatt Online</a></span>
</div>

[...]

<div class="o-article__content-element o-article__content-element--richtext">
<div class="u-richtext ajaxify"
data-command='{"richtext": {}}'>
<p><span class="hcf-location-mark">London</span>Ein mit dem 2013 verstorbenen Oligarchen Boris Beresowski befreundeter russischer Geschäftsmann ist in London tot aufgefunden worden. Nikolai Gluschkow sei nicht mehr am Leben, sagte Anwalt Andrej Borowkow am Dienstag russischen Medien. Er wisse aber nichts über die Umstände und den Zeitpunkt des Todes des 68-Jährigen.</p> </div>
</div>

But I do not understand, why I cant add a ": " after the Name "Nikolai Gluschkow". - As stated above, my code is derived from the "hcf-location-mark" bit and I just don't understand, why it's not working that way:

Code:

    preprocess_regexps    = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2)),
                        (re.compile(r'(<span class="c-overline c-overline--alternate u-uppercase u-letter-spacing u-margin-m c-overline--article">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + ': ' + match.group(2))]

... just in case the colon (":") is the problem, I also tried html instead "& #058; " (without the space, otherwise it won't show up here) but still no avail ...

Any hints, as to what I'm doing wrong here?

Thanks a lot

Hegi

hegi · 03-17-2018, 10:56 AM

Hey folks,

... sorry, but I'm somehow stuck on this.

picking this tag just does not work with the regex:

Code:

<span
class="c-overline c-overline--alternate u-uppercase u-letter-spacing u-margin-m c-overline--article">Nikolai Gluschkow</span>

... I tried reducing it to just c-overline or c-overline--article, but it still does not catch on. Also threw the hcf-location-mark expression out for a while, but still this does not change this one.

I somehow got the impression, that this long list of tags with spaces is the problem ... but I got no clue as how to go about.

Thanks a lot in advance.

Hegi.

Divingduck · 03-18-2018, 06:53 AM

Sorry for my late answer, I was on a trip.

Can't really help. I use classes with spaces in naming too (see remove_tags statement in my last file). That works for my recipe.
You can check with debug what exactly the recipe is doing. Set some print statements in your recipe and pipe all statements in a log file. Maybe this will help you to find out what is going wrong. I have recognize for WiWo that there are also some classes with spaces at the end of the string and / or combinations with more than one space in a class names.

hegi · 03-18-2018, 02:00 PM

Hi Divingduck,

... now this is really interesting. I retrieved the recipe with debugging info via the CLI as follows:

Code:

ebook-convert ~/.config/calibre/custom_recipes/WirtschaftsWoche\ Online_1014.recipe .mobi \
        --mobi-file-type=new --output-profile=kindle_pw --debug-pipeline calibre-debug

The original html-line on the website is like this:

Code:

<h2 class="c-headline c-headline--article u-margin-m"><span
class="c-overline c-overline--alternate u-uppercase u-letter-spacing u-margin-m c-overline--article">Wandel kostet Milliarden</span> SUV und China sollen Audi wieder nach vorne bringen
</h2>

When I now dive into the debugging data, I get in the processed directory the following code:

Code:

<h2 class="c-headline"><span class="c-overline">Wandel kostet Milliarden</span> SUV und China sollen Audi wieder nach vorne bringen
</h2>

This is interesting, as the other tags are not specified in the remove_tags statement. ... OK ...

This leads me to changing the preprocess_regexps as follows:

Code:

    preprocess_regexps    = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2)),
                        (re.compile(r'(<span class="c-overline">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + ': ' + match.group(2))]

But unfortunately this does not change the output in the processed directory:

Code:

<h2 class="c-headline"><span class="c-overline">Wandel kostet Milliarden</span> SUV und China sollen Audi wieder nach vorne bringen
</h2>

I just don't get, why this isn't working for these tags ...

However, I'm not sure what you mean by

Quote:

Set some print statements in your recipe and pipe all statements in a log file.

This sounds like some manual logging workaround I do not understand.

Thanks again, anyway.

Hegi.

Divingduck · 03-18-2018, 05:18 PM

This is a batch script I use for analyzing my recipes. I put recipe and script in one directory (it clean -up first the old run). After the recipe is finish you will find the print screen output in a log file and in folder debug the different conversion stages.

Code:

REM * Remove old debug directory
rmdir debug. /s /q

REM * Delete old recipe and log file
del WirtschaftsWoche.epub
del WirtschaftsWoche.log

REM Run new recipe in debug mode and 
ebook-convert WirtschaftsWoche.recipe .epub -vv --debug-pipeline debug > WirtschaftsWoche.log

You can create additional information output via print command within a recipe like printing a variable:
print '*** my_variable_main --->:', my_variable_main
This is helpful for checking the content you like to modify wether it include what you expect or if a selection will be found or not.

03-04-2018, 05:50 AM	#21
hegi Enthusiast Posts: 44 Karma: 10 Join Date: Dec 2012 Device: Kindle 4 & Kindle PW 3G	Wiwo.de Website Relaunch Hey there, there was a complete webite relaunch at wiwo.de a fortnight ago and now nothing really works anymore .... Grrrrrr! At least the feeds appear to be the same: Code: feeds = [ (u'Schlagzeilen', u'http://www.wiwo.de/contentexport/feed/rss/schlagzeilen'), (u'Exklusiv', u'http://www.wiwo.de/contentexport/feed/rss/exklusiv'), #(u'Themen', u'http://www.wiwo.de/contentexport/feed/rss/themen'), # AGE no print version (u'Unternehmen', u'http://www.wiwo.de/contentexport/feed/rss/unternehmen'), (u'Finanzen', u'http://www.wiwo.de/contentexport/feed/rss/finanzen'), (u'Politik', u'http://www.wiwo.de/contentexport/feed/rss/politik'), (u'Erfolg', u'http://www.wiwo.de/contentexport/feed/rss/erfolg'), (u'Technologie', u'http://www.wiwo.de/contentexport/feed/rss/technologie'), #(u'Green-WiWo', u'http://green.wiwo.de/feed/rss/') # AGE no print version ] So I do get the feeds grouped by category, but when clicking on them I just get a link-list of articles that does not work. However, the major change seems to be, that there are no longer "print-versions" of the pages to be used for extracting. The code snip-that handled that was so far: Code: def print_version(self, url): main, sep, id = url.rpartition('/') return main + '/v_detail_tab_print/' + id My first "hands on" trial just to remove the "+ '/v_detail_tab_print/' " did not make things any better. Any suggestions or hints are mostly welcome. Thanks Hegi.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
LWN.net Weekly News recipe	davide125	Recipes	22	11-12-2014 09:44 PM
Business Week Recipe duplicates	Mixx	Recipes	0	09-16-2012 06:43 AM
beam-ebooks.de: Recipe to download weekly new content?	Rince123	Recipes	0	01-02-2012 03:39 AM
Recipe for Sunday Business Post - Ireland	anne.oneemas	Recipes	15	12-13-2010 05:13 PM
Recipe for Business Spectator (Australia)	RedDogInCan	Recipes	1	12-01-2010 12:34 AM

04-16-2015, 11:55 PM	#17
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Something like Code: style = soup.find('img', alt='WirtschaftsWoche eMagazin', style=True)['style'] self.cover_url = style.partition('(')[-1].rpartition(')')[0]

04-17-2015, 09:49 PM	#19
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Change cover_source_soup = self.index_to_soup(self.cover_source) to soup = self.index_to_soup(self.cover_source)

03-04-2018, 06:02 AM	#22
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Remove the entire print_version function.

03-04-2018, 01:34 PM	#25
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You should name the method print_version() not one_page_version() Your remove_tags should have Code: dict(attrs={'class': lambda x: x and 'c-socialshare' in x.split()}),

03-17-2018, 10:56 AM	#27
hegi Enthusiast Posts: 44 Karma: 10 Join Date: Dec 2012 Device: Kindle 4 & Kindle PW 3G	Hey folks, ... sorry, but I'm somehow stuck on this. picking this tag just does not work with the regex: Code: <span class="c-overline c-overline--alternate u-uppercase u-letter-spacing u-margin-m c-overline--article">Nikolai Gluschkow</span> ... I tried reducing it to just c-overline or c-overline--article, but it still does not catch on. Also threw the hcf-location-mark expression out for a while, but still this does not change this one. I somehow got the impression, that this long list of tags with spaces is the problem ... but I got no clue as how to go about. Thanks a lot in advance. Hegi.

03-18-2018, 06:53 AM	#28
Divingduck Wizard Posts: 1,161 Karma: 1404241 Join Date: Nov 2010 Location: Germany Device: Sony PRS-650	Sorry for my late answer, I was on a trip. Can't really help. I use classes with spaces in naming too (see remove_tags statement in my last file). That works for my recipe. You can check with debug what exactly the recipe is doing. Set some print statements in your recipe and pipe all statements in a log file. Maybe this will help you to find out what is going wrong. I have recognize for WiWo that there are also some classes with spaces at the end of the string and / or combinations with more than one space in a class names.

03-18-2018, 05:18 PM	#30
Divingduck Wizard Posts: 1,161 Karma: 1404241 Join Date: Nov 2010 Location: Germany Device: Sony PRS-650	This is a batch script I use for analyzing my recipes. I put recipe and script in one directory (it clean -up first the old run). After the recipe is finish you will find the print screen output in a log file and in folder debug the different conversion stages. Code: REM * Remove old debug directory rmdir debug. /s /q REM * Delete old recipe and log file del WirtschaftsWoche.epub del WirtschaftsWoche.log REM Run new recipe in debug mode and ebook-convert WirtschaftsWoche.recipe .epub -vv --debug-pipeline debug > WirtschaftsWoche.log You can create additional information output via print command within a recipe like printing a variable: print '*** my_variable_main --->:', my_variable_main This is helpful for checking the content you like to modify wether it include what you expect or if a selection will be found or not.