MobileRead Forums - View Single Post

dicollecte · 04-19-2014, 07:58 AM

Quote:

Originally Posted by kovidgoyal

@roger64: I can probably do what you are asking, but this sort of thing really should be done by the spell check library, which is hunspell in this case. I wonder why it does not do this itself? I am worried that doing so may have some unintended side effects.

Hi,

I’m the French spelling dictionary maintainer. Hunspell handles properly apostrophes but cannot solve our issue here.

Most of French uses for apostrophes are to create elided forms, i.e.
l’animal = the animal
j’aime = I like

There is hundred thousands of these words with elided forms in the French language. Hunspell recognizes these words as it should.

But when a word is not recognized by the dictionary, there is no reason to list several times this word with different elided forms.

Example with articulateur. This word is not present in the dictionary. But when Calibre encounters it in forms like l’articulateur, d’articulateur, articulateur, it considers them as 3 different words whereas it’s the same with or without elided forms.

That’s why these prefixed elided forms should be removed when counting unrecognized words.

Here is a simple algorithm in Python which explains what would be useful for the French language:

Code:

import re
rElidedPrefix = re.compile(u"(?i)^(l|d|m|t|s|j|c|ç|lorsqu|puisqu|quoiqu|qu)['’]")  # needed for FR locale
dict_of_unknown_words = {}

for word in list_of_all_words_in_text:
    if not Hunspell.isvalid(word):
        word = rElidedPrefix.sub("", word) # needed for FR locale
        dict_of_unknown_words[word] = dict_of_unknown_words.get(word, 0)+1

HTH.