Hi
Some months ago, you gave us a nice function which allowed to split words "glued" together. After a mistake of mine, I had the opportunity to use this function on a lot of words on a French EPUB. I have of course installed a French dictionary. Please read on...

The results were amazingly good and quick.
Spoiler:
Code:
import regex
from calibre import replace_entities, prepare_string_for_xml
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
def fix_word(m):
word = m.group()
if dictionaries.recognized(word):
return word
for i in xrange(1, len(word) - 1):
a, b = word[:i], word[i:]
if dictionaries.recognized(a) and dictionaries.recognized(b):
return a + ' ' + b
return word
text = replace_entities(match.group(1))
text = regex.sub(r'\b\w+\b', fix_word, text, flags=regex.VERSION1)
text = prepare_string_for_xml(text)
return '>' + text + '<'
Using the Calibre Editor spell checker before and after the use of this function, I could see that the number of words unknown to the dictionary went down from 1167 to 261. Taking into account the fact that probably 2/3 of the remaining ones were "noms propres" (proper nouns ?), I nevertheless realized that some few words had not been split (50 to 70 probably).
The cause was related with some kind of elided form. Here are some of them. One can easily discern the same pattern: a word followed by one letter and one curved apostrophe (in red here); these last two elements being characteristic of elided forms in French.
accompagnent
n’auront
à
l’origine
dans
l’entrée
de
m’expliquer
de
s’opposer
Et
j’écrasai
il
s’attendait
manuel
d’algèbre
What makes me hope that the function could be improved so as to take care of elided forms is that for
all of them, the
first suggestion of the dictionary of the Calibre editor is to split them correctly.