Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 04-16-2015, 02:20 PM   #16
hegi
Enthusiast
hegi began at the beginning.
 
Posts: 26
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
Wirtschaftswoche Title Image broken since a while

Hey ho,

the code snippet to get the title image for Wirtschaftswoche is currently:

Code:
    cover_source          = 'http://www.wiwo-shop.de/wirtschaftswoche/wirtschaftswoche-emagazin-p1952.html'
[...]

    def get_cover_url(self):
       cover_source_soup = self.index_to_soup(self.cover_source)
       preview_image_div = cover_source_soup.find(attrs={'class':'container vorschau'})
       return 'http://www.wiwo-shop.de'+preview_image_div.a.img['src']
However, they changed the site quite a bit. I have now two choices to gather this image from the web. Either I get it from this bit of http:

(from: https://kaufhaus.handelsblatt.com/do...zin-p1952.html)
Code:
			<div class="carousel-inner">
				<figure class="active item">
					<img src="https://kdww.cekom.de/images/lrn/spacer.gif" style="background: transparent url(https://kdww.cekom.de/images//WW_titel_16-w454-h298-ar.jpg) center center no-repeat;" title="WirtschaftsWoche eMagazin" alt="WirtschaftsWoche eMagazin">
									</figure>
							</div>
This would be: https://kdww.cekom.de/images//WW_titel_16-w454-h298-ar.jpg

or alternatively I take this:

(from: http://www.wiwo.de)
Code:
<div data-vr-zone="Das Aktuelle Heft" class="hcf-mcol-box"><div class="hcf-content hcf-mcol-box-content hcf-decorated-box"><div class="hcf-morewiwo-content" data-vr-contentbox=""><div class="hcf-wiwo-image"><a title="Wirtschaftswoche" target="_blank" href="http://abo.wiwo.de/"><img border="0" alt="Wirtschaftswoche" src="http://www.wiwo.de/images/wirtschaftswoche-cover-16-2015/10019036/46-formatOriginal.gif"/></a></div><div class="hcf-recent-wiwo"><h4 class="hcf-teaser-text">WirtschaftsWoche 16 vom 13.4.2015</h4>
Then I would need this: http://www.wiwo.de/images/wirtschaft...atOriginal.gif


I would prefer the first source (slightly better resolution). However, ... in both cases I fail to make the neccessary adaptions to the soup section. Has anyone a hot tip for me?

Thanks

Hegi.
hegi is offline   Reply With Quote
Old 04-16-2015, 11:55 PM   #17
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 33,406
Karma: 10205094
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Something like

Code:
style = soup.find('img', alt='WirtschaftsWoche eMagazin', style=True)['style']
self.cover_url = style.partition('(')[-1].rpartition(')')[0]
kovidgoyal is offline   Reply With Quote
Old 04-17-2015, 01:52 PM   #18
hegi
Enthusiast
hegi began at the beginning.
 
Posts: 26
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
Dear Kovid,

thanks for your quick reply. - However, as it appears, I still dont't get it right.

I took your snippet and made from it this:

Code:
    cover_source          = 'https://kaufhaus.handelsblatt.com/downloads/wirtschaftswoche-emagazin-p1952.html'

    def get_cover_url(self):
        cover_source_soup = self.index_to_soup(self.cover_source)
        style = soup.find('img', alt='WirtschaftsWoche eMagazin', style=True)['style']
        self.cover_url = style.partition('(')[-1].rpartition(')')[0]
        return self.cover_url
But when I look into the log of the recipe I get:

Quote:
[Cover could not be downloaded]: global name 'soup' is not defined
Traceback (most recent call last):
File "site-packages/calibre/web/feeds/news.py", line 1263, in _download_cover
File "<string>", line 49, in get_cover_url
NameError: global name 'soup' is not defined
Line 49 is your "style = soup.find ... "line.

As far as my limited understanding of these issues goes, I have to embedd this code somehow into an "def ... return ..." section. But maybe your suggestion was meant in a different way?

Thanks again.

Hegi.
hegi is offline   Reply With Quote
Old 04-17-2015, 09:49 PM   #19
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 33,406
Karma: 10205094
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Change

cover_source_soup = self.index_to_soup(self.cover_source)

to

soup = self.index_to_soup(self.cover_source)
kovidgoyal is offline   Reply With Quote
Old 04-18-2015, 05:03 PM   #20
hegi
Enthusiast
hegi began at the beginning.
 
Posts: 26
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
revised WirtschaftsWoche Online Recipe

Thanks again Kovid!

Although I am a great enthusiast for Indian Food, this soup stuff always seems to get me .

Here's the revised recipe to be used for further updates:

Code:
__license__   = 'GPL v3'
__copyright__ = '2013, Armin Geller'

'''
Fetch WirtschaftsWoche Online
'''
import re
#import time
from calibre.web.feeds.news import BasicNewsRecipe
class WirtschaftsWocheOnline(BasicNewsRecipe):
    title                 = u'WirtschaftsWoche Online'
    __author__            = 'Armin Geller' # Update AGE 2013-01-05; Modified by Hegi 2013-04-28
    description           = u'Wirtschaftswoche Online - basierend auf den RRS-Feeds von Wiwo.de'
    tags 	                = 'Nachrichten, Blog, Wirtschaft'
    publisher             = 'Verlagsgruppe Handelsblatt GmbH / Redaktion WirtschaftsWoche Online'
    category              = 'business, economy, news, Germany' 
    publication_type      = 'weekly magazine'
    language              = 'de_DE'
    oldest_article        = 7
    max_articles_per_feed = 100
    simultaneous_downloads= 20
    
    auto_cleanup          = False
    no_stylesheets        = True
    remove_javascript     = True
    remove_empty_feeds    = True

    # don't duplicate articles from "Schlagzeilen" / "Exklusiv" to other rubrics 
    ignore_duplicate_articles = {'title', 'url'}

    # if you want to reduce size for an b/w or E-ink device, uncomment this:
    compress_news_images  = True
    # compress_news_images_auto_size = 16
    scale_news_images     = (400,300)
    compress_news_images_max_size = 35

    timefmt               = ' [%a, %d %b %Y]'

    conversion_options    = { 'smarten_punctuation' : True,
			'authors'		  : publisher,
			'publisher'  	  : publisher }
    language              = 'de_DE'
    encoding              = 'UTF-8'
    cover_source          = 'https://kaufhaus.handelsblatt.com/downloads/wirtschaftswoche-emagazin-p1952.html'
    masthead_url          = 'http://www.wiwo.de/images/wiwo_logo/5748610/1-formatOriginal.png'

    def get_cover_url(self):
        soup = self.index_to_soup(self.cover_source) 
        style = soup.find('img', alt='WirtschaftsWoche eMagazin', style=True)['style']
        self.cover_url = style.partition('(')[-1].rpartition(')')[0]
        return self.cover_url

    # Insert ". " after "Place" in <span class="hcf-location-mark">Place</span>
    # If you use .epub format you could also do this as extra_css '.hcf-location-mark:after {content: ". "}' 
    preprocess_regexps    = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2))]

    extra_css      =  'h1 {font-size: 1.6em; text-align: left} \
                       h2 {font-size: 1em; font-style: italic; font-weight: normal} \
                       h3 {font-size: 1.3em;text-align: left} \
                       h4, h5, h6, a {font-size: 1em;text-align: left} \
                       .hcf-caption {font-size: 1em;text-align: left; font-style: italic} \
	             .hcf-location-mark {font-style: italic}' 

    keep_only_tags    = [
                          dict(name='div', attrs={'class':['hcf-column hcf-column1 hcf-teasercontainer hcf-maincol']}),
                          dict(name='div', attrs={'id':['contentMain']})
                        ]

    remove_tags = [
                    dict(name='div', attrs={'class':['hcf-link-block hcf-faq-open', 'hcf-article-related']})
                  ]

    feeds = [
              (u'Schlagzeilen', u'http://www.wiwo.de/contentexport/feed/rss/schlagzeilen'), 
              (u'Exklusiv', u'http://www.wiwo.de/contentexport/feed/rss/exklusiv'),
              #(u'Themen', u'http://www.wiwo.de/contentexport/feed/rss/themen'), # AGE no print version
              (u'Unternehmen', u'http://www.wiwo.de/contentexport/feed/rss/unternehmen'), 
              (u'Finanzen', u'http://www.wiwo.de/contentexport/feed/rss/finanzen'), 
              (u'Politik', u'http://www.wiwo.de/contentexport/feed/rss/politik'), 
              (u'Erfolg', u'http://www.wiwo.de/contentexport/feed/rss/erfolg'), 
              (u'Technologie', u'http://www.wiwo.de/contentexport/feed/rss/technologie'), 
              #(u'Green-WiWo', u'http://green.wiwo.de/feed/rss/') # AGE no print version
            ]
    def print_version(self, url):
        main, sep, id = url.rpartition('/')
        return main + '/v_detail_tab_print/' + id

CU

Hegi.
hegi is offline   Reply With Quote
Old 03-04-2018, 05:50 AM   #21
hegi
Enthusiast
hegi began at the beginning.
 
Posts: 26
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
Wiwo.de Website Relaunch

Hey there,

there was a complete webite relaunch at wiwo.de a fortnight ago and now nothing really works anymore .... Grrrrrr!

At least the feeds appear to be the same:
Code:
feeds = [
              (u'Schlagzeilen', u'http://www.wiwo.de/contentexport/feed/rss/schlagzeilen'), 
              (u'Exklusiv', u'http://www.wiwo.de/contentexport/feed/rss/exklusiv'),
              #(u'Themen', u'http://www.wiwo.de/contentexport/feed/rss/themen'), # AGE no print version
              (u'Unternehmen', u'http://www.wiwo.de/contentexport/feed/rss/unternehmen'), 
              (u'Finanzen', u'http://www.wiwo.de/contentexport/feed/rss/finanzen'), 
              (u'Politik', u'http://www.wiwo.de/contentexport/feed/rss/politik'), 
              (u'Erfolg', u'http://www.wiwo.de/contentexport/feed/rss/erfolg'), 
              (u'Technologie', u'http://www.wiwo.de/contentexport/feed/rss/technologie'), 
              #(u'Green-WiWo', u'http://green.wiwo.de/feed/rss/') # AGE no print version
            ]
So I do get the feeds grouped by category, but when clicking on them I just get a link-list of articles that does not work.

However, the major change seems to be, that there are no longer "print-versions" of the pages to be used for extracting. The code snip-that handled that was so far:
Code:
    def print_version(self, url):
        main, sep, id = url.rpartition('/')
        return main + '/v_detail_tab_print/' + id
My first "hands on" trial just to remove the "+ '/v_detail_tab_print/' " did not make things any better.

Any suggestions or hints are mostly welcome.

Thanks

Hegi.
hegi is offline   Reply With Quote
Old 03-04-2018, 06:02 AM   #22
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 33,406
Karma: 10205094
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Remove the entire print_version function.
kovidgoyal is offline   Reply With Quote
Old 03-04-2018, 06:31 AM   #23
Divingduck
Wizard
Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.
 
Posts: 1,116
Karma: 1404167
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
Hi hegi,

this is my actual version. Take what you need for your recipe.

DD
Attached Files
File Type: zip WirtschaftsWoche_AGe_V4.2.zip (1.4 KB, 20 views)
Divingduck is offline   Reply With Quote
Old 03-04-2018, 11:48 AM   #24
hegi
Enthusiast
hegi began at the beginning.
 
Posts: 26
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
Thanks Kovid,

wasn't sure I can remove the whole lot.

I'm a fair bit further down the road, but unfortunately not yet there ... Here are some of the issues I'm battling with:

From you I got the code snippet that adds a "." after "hcf-location-mark" class:
Code:
preprocess_regexps    = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2))]
I wanted to extend this to further add a ":" after "c-overline--article" class, but obviously it does not work the way I tried. - Why?
Code:
    preprocess_regexps    = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2)),
                        (re.compile(r'(<span class="c-overline c-overline--alternate u-uppercase u-letter-spacing u-margin-m c-overline--article">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + ': ' + match.group(2))]
Then there is something, that is similar to the print-version thing, but different. Some Articles are "multi-page" articles. However, in case they are, there is a separate "-all" version, where there is an "-all" before the ".html".
Just checked a few articles. The good news is, I can call all articles as "-all" version.

E.g. the article being https://www.wiwo.de/erfolg/managemen.../21022938.html

Then I'd need something analogue to the printversion thing to add this to the url making it ".../21022938-all.html".

My best guess would be something like:
Code:
    def one_page_version(self, url):
        main, sep, tail = url.rpartition('.')
        return main + '-all.' + tail
But it does not work, so any help is appreciated.

And then there is an annoying bit. I want to get rid of the embedded social media stuff. The code looks like this

Code:
<div class="o-article__element">
<div class="c-socialshare u-margin-xxl ">
<h3 class="c-socialshare__headline u-margin-xl u-font-bold u-font-m-sm">
Diesen Artikel teilen:
</h3>
<a class="ajaxify c-socials__item c-socials__item--facebook "
title="Auf Facebook teilen"
data-command='{"trackSocial": "Facebook", "socialshare": {"provider": "facebook", "url": "https://www.wiwo.de/21030160.html", "content": "", "providerUrl": "https://www.facebook.com/sharer/sharer.php?display=popup&amp;u=https%3A%2F%2Fwww.wiwo.de%2F21030160.html%3Fshare%3Dfb"}}'>
<span class="c-socials__icon c-socials__icon--facebook">
<span class="c-socials__icon--background"><i class="c-icon c-icon__socials c-icon--facebook"></i></span>
</span>
<span class="c-socials__text">Facebook</span>
</a>
<a class="ajaxify c-socials__item c-socials__item--twitter "
title="Auf Twitter teilen"
data-command='{"trackSocial": "Twitter", "socialshare": {"provider": "twitter", "url": "https://www.wiwo.de/21030160.html", "content": "", "providerUrl": "https://twitter.com/intent/tweet?url=https%3A%2F%2Fwww.wiwo.de%2F21030160.html%3Fshare%3Dtwitter&amp;text=Versicherungsbranche: Axa%20will%20offenbar%20US-Konkurrenten%20XL%20Group%20kaufen&amp;hashtags="}}'>
<span class="c-socials__icon c-socials__icon--twitter">
<span class="c-socials__icon--background"><i class="c-icon c-icon__socials c-icon--twitter"></i></span>
</span>
<span class="c-socials__text">Twitter</span>
</a>
<a class="ajaxify c-socials__item c-socials__item--xing "
title="Auf Xing teilen"
data-command='{"trackSocial": "Xing", "socialshare": {"provider": "xing", "url": "https://www.wiwo.de/21030160.html", "content": "", "providerUrl": "https://www.xing-share.com/app/user?op=share;sc_p=xing-share;url=https%3A%2F%2Fwww.wiwo.de%2F21030160.html%3Fshare%3Dxing"}}'>
<span class="c-socials__icon c-socials__icon--xing">
<span class="c-socials__icon--background"><i class="c-icon c-icon__socials c-icon--xing"></i></span>
</span>
<span class="c-socials__text">Xing</span>
</a>
<a class="ajaxify c-socials__item c-socials__item--whatsapp u-desktop-hidden hidden-md-up"
title="Per Whatsapp teilen"
data-command='{"trackSocial": "Whatsapp", "socialshare": {"provider": "whatsapp", "url": "https://www.wiwo.de/21030160.html", "content": "", "providerUrl": "whatsapp://send?text=Versicherungsbranche: Axa%20will%20offenbar%20US-Konkurrenten%20XL%20Group%20kaufen https%3A%2F%2Fwww.wiwo.de%2F21030160.html%3Fshare%3Dwhatsapp"}}'>
<span class="c-socials__icon c-socials__icon--whatsapp">
<span class="c-socials__icon--background"><i class="c-icon c-icon__socials c-icon--whatsapp"></i></span>
</span>
<span class="c-socials__text">Whatsapp</span>
</a>
<a class="ajaxify c-socials__item c-socials__item--mail "
title="Per Mail teilen"
data-command='{"trackSocial": "Mail", "socialshare": {"provider": "mail", "url": "https://www.wiwo.de/21030160.html", "content": "", "providerUrl": "mailto: ?subject=Versicherungsbranche: Axa%20will%20offenbar%20US-Konkurrenten%20XL%20Group%20kaufen - WirtschaftsWoche&amp;body=https%3A%2F%2Fwww.wiwo.de%2F21030160.html%3Fshare%3Dmail"}}'>
<span class="c-socials__icon c-socials__icon--mail">
<span class="c-socials__icon--background"><i class="c-icon c-icon__socials c-icon--mail"></i></span>
</span>
<span class="c-socials__text">Mail</span>
</a>
</div>
</div>
But the remove-tags in my recepie don't catch it:
Code:
    remove_tags = [
                    dict(name='div', attrs={'class':['c-socialshare__headline', 'c-socials__item', 'c-pagination u-flex ajaxify', 'u-font-bold']})
If I manage thesse issues, we are almost there for a new version of the recepie to share ...

Thanks folks!

Hegi.
hegi is offline   Reply With Quote
Old 03-04-2018, 01:34 PM   #25
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 33,406
Karma: 10205094
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You should name the method print_version() not one_page_version()

Your remove_tags should have

Code:
dict(attrs={'class': lambda x: x and 'c-socialshare' in x.split()}),
kovidgoyal is offline   Reply With Quote
Old 03-13-2018, 04:20 PM   #26
hegi
Enthusiast
hegi began at the beginning.
 
Posts: 26
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
Thanks Divingduck & Kovid,

... I'm getting better on this one. - And slowly I'm things turn out as neat as wanted. However this one thing is bugging me:

In the source I have e.g. (lookout for the bold tags):
Code:
<h2 class="c-headline c-headline--article u-margin-m"><span
class="c-overline c-overline--alternate u-uppercase u-letter-spacing u-margin-m c-overline--article">Nikolai Gluschkow</span> Russischer Geschäftsmann tot in London entdeckt
</h2>
<div class="c-metadata u-margin-xl ">
<div>
</div>
<time
datetime="2018-03-13T19:38:19+01:00">13. März 2018</time>
<span>, aktualisiert
<time datetime="2018-03-13T19:40:57+01:00">13. März 2018, 19:40 Uhr</time>
</span>
<span class="c-metadata__source"> | Quelle: <a href="http://www.handelsblatt.com"
target="_blank">Handelsblatt Online</a></span>
</div>

[...]

<div class="o-article__content-element o-article__content-element--richtext">
<div class="u-richtext ajaxify"
data-command='{"richtext": {}}'>
<p><span class="hcf-location-mark">London</span>Ein mit dem 2013 verstorbenen Oligarchen Boris Beresowski befreundeter russischer Geschäftsmann ist in London tot aufgefunden worden. Nikolai Gluschkow sei nicht mehr am Leben, sagte Anwalt Andrej Borowkow am Dienstag russischen Medien. Er wisse aber nichts über die Umstände und den Zeitpunkt des Todes des 68-Jährigen.</p> </div>
</div>
But I do not understand, why I cant add a ": " after the Name "Nikolai Gluschkow". - As stated above, my code is derived from the "hcf-location-mark" bit and I just don't understand, why it's not working that way:

Code:
    preprocess_regexps    = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2)),
                        (re.compile(r'(<span class="c-overline c-overline--alternate u-uppercase u-letter-spacing u-margin-m c-overline--article">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + ': ' + match.group(2))]
... just in case the colon (":") is the problem, I also tried html instead "& #058; " (without the space, otherwise it won't show up here) but still no avail ...

Any hints, as to what I'm doing wrong here?

Thanks a lot

Hegi
hegi is offline   Reply With Quote
Old 03-17-2018, 10:56 AM   #27
hegi
Enthusiast
hegi began at the beginning.
 
Posts: 26
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
Hey folks,

... sorry, but I'm somehow stuck on this.

picking this tag just does not work with the regex:
Code:
<span
class="c-overline c-overline--alternate u-uppercase u-letter-spacing u-margin-m c-overline--article">Nikolai Gluschkow</span>
... I tried reducing it to just c-overline or c-overline--article, but it still does not catch on. Also threw the hcf-location-mark expression out for a while, but still this does not change this one.

I somehow got the impression, that this long list of tags with spaces is the problem ... but I got no clue as how to go about.

Thanks a lot in advance.

Hegi.
hegi is offline   Reply With Quote
Old 03-18-2018, 06:53 AM   #28
Divingduck
Wizard
Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.
 
Posts: 1,116
Karma: 1404167
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
Sorry for my late answer, I was on a trip.

Can't really help. I use classes with spaces in naming too (see remove_tags statement in my last file). That works for my recipe.
You can check with debug what exactly the recipe is doing. Set some print statements in your recipe and pipe all statements in a log file. Maybe this will help you to find out what is going wrong. I have recognize for WiWo that there are also some classes with spaces at the end of the string and / or combinations with more than one space in a class names.
Divingduck is offline   Reply With Quote
Old 03-18-2018, 02:00 PM   #29
hegi
Enthusiast
hegi began at the beginning.
 
Posts: 26
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
Hi Divingduck,

... now this is really interesting. I retrieved the recipe with debugging info via the CLI as follows:

Code:
ebook-convert ~/.config/calibre/custom_recipes/WirtschaftsWoche\ Online_1014.recipe .mobi \
        --mobi-file-type=new --output-profile=kindle_pw --debug-pipeline calibre-debug
The original html-line on the website is like this:
Code:
<h2 class="c-headline c-headline--article u-margin-m"><span
class="c-overline c-overline--alternate u-uppercase u-letter-spacing u-margin-m c-overline--article">Wandel kostet Milliarden</span> SUV und China sollen Audi wieder nach vorne bringen
</h2>
When I now dive into the debugging data, I get in the processed directory the following code:
Code:
<h2 class="c-headline"><span class="c-overline">Wandel kostet Milliarden</span> SUV und China sollen Audi wieder nach vorne bringen
</h2>
This is interesting, as the other tags are not specified in the remove_tags statement. ... OK ...

This leads me to changing the preprocess_regexps as follows:
Code:
    preprocess_regexps    = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2)),
                        (re.compile(r'(<span class="c-overline">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + ': ' + match.group(2))]
But unfortunately this does not change the output in the processed directory:
Code:
<h2 class="c-headline"><span class="c-overline">Wandel kostet Milliarden</span> SUV und China sollen Audi wieder nach vorne bringen
</h2>
I just don't get, why this isn't working for these tags ...

However, I'm not sure what you mean by
Quote:
Set some print statements in your recipe and pipe all statements in a log file.
This sounds like some manual logging workaround I do not understand.

Thanks again, anyway.

Hegi.
hegi is offline   Reply With Quote
Old 03-18-2018, 05:18 PM   #30
Divingduck
Wizard
Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.
 
Posts: 1,116
Karma: 1404167
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
This is a batch script I use for analyzing my recipes. I put recipe and script in one directory (it clean -up first the old run). After the recipe is finish you will find the print screen output in a log file and in folder debug the different conversion stages.

Code:
REM * Remove old debug directory
rmdir debug. /s /q

REM * Delete old recipe and log file
del WirtschaftsWoche.epub
del WirtschaftsWoche.log

REM Run new recipe in debug mode and 
ebook-convert WirtschaftsWoche.recipe .epub -vv --debug-pipeline debug > WirtschaftsWoche.log
You can create additional information output via print command within a recipe like printing a variable:
print '*** my_variable_main --->:', my_variable_main
This is helpful for checking the content you like to modify wether it include what you expect or if a selection will be found or not.
Divingduck is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
LWN.net Weekly News recipe davide125 Recipes 22 11-12-2014 09:44 PM
Business Week Recipe duplicates Mixx Recipes 0 09-16-2012 06:43 AM
beam-ebooks.de: Recipe to download weekly new content? Rince123 Recipes 0 01-02-2012 03:39 AM
Recipe for Sunday Business Post - Ireland anne.oneemas Recipes 15 12-13-2010 05:13 PM
Recipe for Business Spectator (Australia) RedDogInCan Recipes 1 12-01-2010 12:34 AM


All times are GMT -4. The time now is 01:30 PM.


MobileRead.com is a privately owned, operated and funded community.