Hi
I had to prepare it for another book. I join a reduced version of a French book. As you can check first, it contains now:
- 0 nnbsp (\u202F)
- 371 hyphen-minus (-) + 10 wrong ones set in chapter2: total: 381
1. - The first function is about
nnbsp
search:
function text:
Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
return match.group().replace("'","’").replace(">— ",">—@").replace(">—",">—@").replace(" !","@!").replace("!","@!").replace(" ?","@?").replace("?","@?").replace(" ;","@;").replace(";","@;").replace(" :","@:").replace(":","@:").replace("« ","«@").replace("«","«@").replace(" »","@»").replace("»","@»").replace("@@","@").replace("@","\u202f")
It announces for me:
536 replacements done
But if you check with a regex there are now
650 nnbsp.
2. - The second function is
hyphen (coupled with French dictionary)
search:
function text
Code:
import regex
from calibre import replace_entities
from calibre import prepare_string_for_xml
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
def replace_word(wmatch):
# Try to remove the hyphen and replace the words if the resulting
# hyphen free word is recognized by the dictionary
without_hyphen = wmatch.group(1) + wmatch.group(2)
if dictionaries.recognized(without_hyphen):
return without_hyphen
return wmatch.group()
# Search for words split by a hyphen
text = replace_entities(match.group()[1:-1]) # Handle HTML entities like &
corrected = regex.sub(r'(\w+)\s*-\s*(\w+)', replace_word, text, flags=regex.VERSION1 | regex.UNICODE)
return '>%s<' % prepare_string_for_xml(corrected) # Put back required entities
After it ran, it reports
996 replacements.
But only
15 were done: the 10 wrong in chapter 2 have been corrected and the following 5: frou-frou(2), en-tęte, par-dessus, porte-manteaux. These last five should not have been corrected but this is another story.