Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 05-13-2019, 02:48 PM   #1
Leonatus
Guru
Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.
 
Leonatus's Avatar
 
Posts: 662
Karma: 7077424
Join Date: Mar 2013
Location: Berlin, Germany
Device: Kobo Touch
Replacement of Replacement Character

Once that I'm about to adjust my news download, I've still got a tiny little question: My news have in the online original quotation marks of this sort:
Code:
„...“
.
In the downloaded news they are replaced by the replacement character:
Code:
�...�
.
No big problem, but ... ugly.
Is it possible to edit the recipe in a way that replaces the replacement characters by quotation marks (of any kind)?
The original site is encoded in ISO-8859-1, and so is the encoding of the recipe. I replaced it by utf-8, but this didn't help.
Leonatus is offline   Reply With Quote
Old 05-14-2019, 02:32 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 34,335
Karma: 10323932
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Make sure the encoding field in the recipe matches the encoding of the website and you will be fine. if you want to do search and replace in the recipe you can use preprocess_regexps
kovidgoyal is offline   Reply With Quote
Advert
Old 05-14-2019, 02:49 AM   #3
Leonatus
Guru
Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.
 
Leonatus's Avatar
 
Posts: 662
Karma: 7077424
Join Date: Mar 2013
Location: Berlin, Germany
Device: Kobo Touch
This has been the first thing I've been trying in spite of my technical ignorance: to check if the encoding of the original website where the news is from corresponded to the encoding of the recipe - and to my astonishment it did. So this is not the culprit, as it seems.

How do I use preprocess_regexps "step by step", please (for I'm really technically ignorant, sorry)?

Edit: In the mean time I noticed that in single articles the quotation marks are displayed correctly, maintaining the same source code as the other articles. Hm .. the thing becomes interesting.

Edit': There is one difference, however: In the articles with replacement character, quotes are represented by „...“, whereas in the correctly dispayed articles they are "...".

Last edited by Leonatus; 05-14-2019 at 03:22 AM.
Leonatus is offline   Reply With Quote
Old 05-14-2019, 06:18 AM   #4
Leonatus
Guru
Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.
 
Leonatus's Avatar
 
Posts: 662
Karma: 7077424
Join Date: Mar 2013
Location: Berlin, Germany
Device: Kobo Touch
I read in Calibre's documentation that the preprocess_regexps should look like that:
Code:
preprocess_regexps = [
   (re.compile(r'<!--Article ends here-->.*</body>', re.DOTALL|re.IGNORECASE),
    lambda match: '</body>'),
]
Unfortunately, I have no idea how to progreed in order replace all „ and “ by ". Could one of the pros here give me, please, a hint how to do this?
Leonatus is offline   Reply With Quote
Old 05-14-2019, 02:14 PM   #5
siebert
Developer
siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.
 
Posts: 144
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad Air 2 WiFi / Moto Z3 Play (Android)
Untested:

Code:
preprocess_regexps = [
   (re.compile(r'[„“]'),
    lambda match: '"'),
]
siebert is offline   Reply With Quote
Advert
Old 05-14-2019, 02:56 PM   #6
Leonatus
Guru
Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.
 
Leonatus's Avatar
 
Posts: 662
Karma: 7077424
Join Date: Mar 2013
Location: Berlin, Germany
Device: Kobo Touch
Thank you, but doesn't work. The replacement characters still appear.
Leonatus is offline   Reply With Quote
Old 05-14-2019, 03:05 PM   #7
siebert
Developer
siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.
 
Posts: 144
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad Air 2 WiFi / Moto Z3 Play (Android)
I don't think I ever used unicode in regular expressions. Did you just copy my code or did you try to replace the „“ chars in it with the ones copied from the source webpage?

Otherwise this variant might work better:

Code:
preprocess_regexps = [
   (re.compile(r'„|“'),
    lambda match: '"'),
]
Or you could post the whole recipe here, so I can test it.
siebert is offline   Reply With Quote
Old 05-14-2019, 03:14 PM   #8
Leonatus
Guru
Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.
 
Leonatus's Avatar
 
Posts: 662
Karma: 7077424
Join Date: Mar 2013
Location: Berlin, Germany
Device: Kobo Touch
The variant didn't work either. I had simply copy/pasted the code fromyour post, the characters reproduced in #1 beeing originally copied from the website resp. the ebook-viewer of Calibre (the display is the same as on my reader).
The recipe is originally this:
Code:
from calibre.web.feeds.news import BasicNewsRecipe


class AdvancedUserRecipe1295262156(BasicNewsRecipe):
    title = u'kath.net'
    __author__ = 'Bobus'
    description = u'Katholische Nachrichten'
    oldest_article = 7
    language = 'de'
    max_articles_per_feed = 100
    no_stylesheets = True
    encoding = 'iso-8859-1'

    feeds = [(u'kath.net', u'https://www.kath.net/2005/xml/index.xml')]

    def print_version(self, url):
        return url + "/print/yes"

    def get_browser(self, *a, **kwargs):
        kwargs['verify_ssl_certificates'] = False
        return BasicNewsRecipe.get_browser(self, *a, **kwargs)

    extra_css = 'td.textb {font-size: medium;}'
thank you for testing!
Leonatus is offline   Reply With Quote
Old 05-14-2019, 04:12 PM   #9
siebert
Developer
siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.
 
Posts: 144
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad Air 2 WiFi / Moto Z3 Play (Android)
Sorry, all the things I googled and tried didn't work. I'm running out of ideas.
siebert is offline   Reply With Quote
Old 05-15-2019, 12:49 AM   #10
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 34,335
Karma: 10323932
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
you need to replace the replacement character, not the quote, since the quote will already have been repaced by the replacement character at the time preprocess_regexp runs
kovidgoyal is offline   Reply With Quote
Old 05-15-2019, 02:06 AM   #11
Leonatus
Guru
Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.
 
Leonatus's Avatar
 
Posts: 662
Karma: 7077424
Join Date: Mar 2013
Location: Berlin, Germany
Device: Kobo Touch
Quote:
Originally Posted by kovidgoyal View Post
you need to replace the replacement character, not the quote, since the quote will already have been repaced by the replacement character at the time preprocess_regexp runs
Hm, that has been my consideration, too, but it didn't work either at least following Siebert's suggestion. Anyway, thanks for the help!
Leonatus is offline   Reply With Quote
Old 05-15-2019, 10:21 AM   #12
Leonatus
Guru
Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.
 
Leonatus's Avatar
 
Posts: 662
Karma: 7077424
Join Date: Mar 2013
Location: Berlin, Germany
Device: Kobo Touch
Should I perhaps escape the replacement character, and how do I do this?
Leonatus is offline   Reply With Quote
Old 05-15-2019, 10:51 AM   #13
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 23,078
Karma: 24012262
Join Date: Aug 2009
Location: (The original) Silicon Valley, USA
Device: K4NT, Galaxy Tab A, Kobo Aura2
Quote:
Originally Posted by Leonatus View Post
Should I perhaps escape the replacement character, and how do I do this?
the backslash is the 'escape'. \\ allows the \ to be the target.
in theory you could escape any character \e\s\c\a\p\e
(if in doubt, I escape symbols I search for. Not all, really need to be escaped)
theducks is offline   Reply With Quote
Old 05-15-2019, 11:11 AM   #14
Leonatus
Guru
Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.Leonatus ought to be getting tired of karma fortunes by now.
 
Leonatus's Avatar
 
Posts: 662
Karma: 7077424
Join Date: Mar 2013
Location: Berlin, Germany
Device: Kobo Touch
Quote:
Originally Posted by theducks View Post
the backslash is the 'escape'. \\ allows the \ to be the target.
in theory you could escape any character \e\s\c\a\p\e
(if in doubt, I escape symbols I search for. Not all, really need to be escaped)
I did this, but at no avail. My thought now is that perhaps the ISO 8859-1 code for the replacement character should be searched for, but this is very much beyond my capacities.
Edit: In Wikipedia Specials (Unicode block) I found this: "... It has become increasingly common for software to interpret invalid UTF-8 by guessing the bytes are in another byte-based encoding such as ISO-8859-1."

Last edited by Leonatus; 05-15-2019 at 11:19 AM.
Leonatus is offline   Reply With Quote
Old 05-15-2019, 07:40 PM   #15
lui1
Enthusiast
lui1 began at the beginning.
 
Posts: 25
Karma: 10
Join Date: Dec 2017
Location: Los Angeles, CA
Device: Smart Phone
According to wikipedia (see ISO-8859-1 and Windows-1252) webpages and emails are commonly mislabeled with the encoding ISO-8859-1 when it should be Windows-1252. Most web browsers and email clients will treat this encoding as Windows-1252. This practice is so prevalent that it became part of the HTML5 specification. So any webpage which claims to be encoded with ISO-8859-1 should be treated as being encoded with Windows-1252.

Code:
encoding = 'windows-1252'

Last edited by lui1; 05-15-2019 at 07:51 PM. Reason: fix typos
lui1 is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Should I go for a replacement? n33raj18 Amazon Kindle 14 08-28-2014 07:18 AM
Replacement Character Frustration amo48 Sigil 4 05-18-2012 12:43 PM
Touch Replacement Plan PeterT Kobo Reader 3 06-18-2011 08:09 PM
regex for character replacement, em-dash questions cybmole Calibre 3 10-18-2010 03:09 PM
PRS-600 So, should I ask for a replacement? ziegl027 Sony Reader 8 01-25-2010 10:40 AM


All times are GMT -4. The time now is 10:57 PM.


MobileRead.com is a privately owned, operated and funded community.