MobileRead Forums - View Single Post

roger64 · 06-18-2016, 07:45 AM

Hi

I had to prepare it for another book. I join a reduced version of a French book. As you can check first, it contains now:
- 0 nnbsp (\u202F)
- 371 hyphen-minus (-) + 10 wrong ones set in chapter2: total: 381

1. - The first function is about nnbsp
search:

Code:

>[^\n<]*?<

function text:

Code:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    return match.group().replace("'","’").replace(">— ",">—@").replace(">—",">—@").replace(" !","@!").replace("!","@!").replace(" ?","@?").replace("?","@?").replace(" ;","@;").replace(";","@;").replace(" :","@:").replace(":","@:").replace("« ","«@").replace("«","«@").replace(" »","@»").replace("»","@»").replace("@@","@").replace("@","\u202f")

It announces for me: 536 replacements done
But if you check with a regex there are now 650 nnbsp.

2. - The second function is hyphen (coupled with French dictionary)

search:

Code:

>[^<>]+<

function text

Code:

import regex
from calibre import replace_entities
from calibre import prepare_string_for_xml

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):

    def replace_word(wmatch):
        # Try to remove the hyphen and replace the words if the resulting
        # hyphen free word is recognized by the dictionary
        without_hyphen = wmatch.group(1) + wmatch.group(2)
        if dictionaries.recognized(without_hyphen):
            return without_hyphen
        return wmatch.group()

    # Search for words split by a hyphen
    text = replace_entities(match.group()[1:-1])  # Handle HTML entities like &amp;
    corrected = regex.sub(r'(\w+)\s*-\s*(\w+)', replace_word, text, flags=regex.VERSION1 | regex.UNICODE)
    return '>%s<' % prepare_string_for_xml(corrected)  # Put back required entities

After it ran, it reports 996 replacements.
But only 15 were done: the 10 wrong in chapter 2 have been corrected and the following 5: frou-frou(2), en-tête, par-dessus, porte-manteaux. These last five should not have been corrected but this is another story.