MobileRead Forums - View Single Post

roger64 · 09-19-2016, 04:29 PM

Hi

Some months ago, you gave us a nice function which allowed to split words "glued" together. After a mistake of mine, I had the opportunity to use this function on a lot of words on a French EPUB. I have of course installed a French dictionary. Please read on...

The results were amazingly good and quick.

Spoiler:

Using the Calibre Editor spell checker before and after the use of this function, I could see that the number of words unknown to the dictionary went down from 1167 to 261. Taking into account the fact that probably 2/3 of the remaining ones were "noms propres" (proper nouns ?), I nevertheless realized that some few words had not been split (50 to 70 probably).

The cause was related with some kind of elided form. Here are some of them. One can easily discern the same pattern: a word followed by one letter and one curved apostrophe (in red here); these last two elements being characteristic of elided forms in French.

accompagnentn’auront
àl’origine
dansl’entrée
dem’expliquer
des’opposer
Etj’écrasai
ils’attendait
manueld’algèbre

What makes me hope that the function could be improved so as to take care of elided forms is that for all of them, the first suggestion of the dictionary of the Calibre editor is to split them correctly.

09-19-2016, 04:29 PM	#1
roger64 Wizard Posts: 2,625 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	adjusting a function Hi Some months ago, you gave us a nice function which allowed to split words "glued" together. After a mistake of mine, I had the opportunity to use this function on a lot of words on a French EPUB. I have of course installed a French dictionary. Please read on... The results were amazingly good and quick. Spoiler: Code: >([^<]+)< Code: import regex from calibre import replace_entities, prepare_string_for_xml def replace(match, number, file_name, metadata, dictionaries, data, functions, args, kwargs): def fix_word(m): word = m.group() if dictionaries.recognized(word): return word for i in xrange(1, len(word) - 1): a, b = word[:i], word[i:] if dictionaries.recognized(a) and dictionaries.recognized(b): return a + ' ' + b return word text = replace_entities(match.group(1)) text = regex.sub(r'\b\w+\b', fix_word, text, flags=regex.VERSION1) text = prepare_string_for_xml(text) return '>' + text + '<' Using the Calibre Editor spell checker before and after the use of this function, I could see that the number of words unknown to the dictionary went down from 1167 to 261. Taking into account the fact that probably 2/3 of the remaining ones were "noms propres" (proper nouns ?), I nevertheless realized that some few words had not been split (50 to 70 probably). The cause was related with some kind of elided form. Here are some of them. One can easily discern the same pattern: a word followed by one letter and one curved apostrophe (in red here); these last two elements being characteristic of elided forms in French. accompagnentn’auront àl’origine dansl’entrée dem’expliquer des’opposer Etj’écrasai ils’attendait manueld’algèbre What makes me hope that the function could be improved so as to take care of elided forms is that for all of them, the first suggestion* of the dictionary of the Calibre editor is to split them correctly. Last edited by roger64; 09-19-2016 at 04:39 PM.