View Single Post
Old 06-18-2016, 06:45 AM   #4
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,625
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Hi

I had to prepare it for another book. I join a reduced version of a French book. As you can check first, it contains now:
- 0 nnbsp (\u202F)
- 371 hyphen-minus (-) + 10 wrong ones set in chapter2: total: 381

1. - The first function is about nnbsp
search:
Code:
>[^\n<]*?<
function text:
Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    return match.group().replace("'","’").replace(">— ",">—@").replace(">—",">—@").replace(" !","@!").replace("!","@!").replace(" ?","@?").replace("?","@?").replace(" ;","@;").replace(";","@;").replace(" :","@:").replace(":","@:").replace("« ","«@").replace("«","«@").replace(" »","@»").replace("»","@»").replace("@@","@").replace("@","\u202f")
It announces for me: 536 replacements done
But if you check with a regex there are now 650 nnbsp.

2. - The second function is hyphen (coupled with French dictionary)

search:
Code:
>[^<>]+<
function text
Code:
import regex
from calibre import replace_entities
from calibre import prepare_string_for_xml

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):

    def replace_word(wmatch):
        # Try to remove the hyphen and replace the words if the resulting
        # hyphen free word is recognized by the dictionary
        without_hyphen = wmatch.group(1) + wmatch.group(2)
        if dictionaries.recognized(without_hyphen):
            return without_hyphen
        return wmatch.group()

    # Search for words split by a hyphen
    text = replace_entities(match.group()[1:-1])  # Handle HTML entities like &amp;
    corrected = regex.sub(r'(\w+)\s*-\s*(\w+)', replace_word, text, flags=regex.VERSION1 | regex.UNICODE)
    return '>%s<' % prepare_string_for_xml(corrected)  # Put back required entities
After it ran, it reports 996 replacements.
But only 15 were done: the 10 wrong in chapter 2 have been corrected and the following 5: frou-frou(2), en-tęte, par-dessus, porte-manteaux. These last five should not have been corrected but this is another story.
Attached Files
File Type: epub Gloriette v2.epub (426.1 KB, 233 views)

Last edited by roger64; 06-18-2016 at 09:17 AM.
roger64 is offline   Reply With Quote