MobileRead Forums - View Single Post

roger64 · 04-19-2014, 01:53 AM

Hi

Dealing with French apostrophes

As you probably know, French need to use normally curly apostrophes (named also "typographic") in their texts, though many use also straight ones... so practically, there is a need for taking into account the two forms.

I gave an EPUB to the French author of Grammalecte, Olivier_R., to comment on the problem about elided forms. He gave me the following reply. I kept the original French text and gave under each paragraph an English translation.

Here it is:

Deux solutions :
Two solutions:
1. La plus simple, mais pas forcément adéquate dans tous les cas : lorsqu’un mot n’est pas reconnu par le dictionnaire français, demandez à ce que l’apostrophe soit considérée comme un séparateur et ne retenir que la deuxième partie (celle qui suit l’apostrophe) dans le décompte des mots.

1. The simplest, but maybe not the most precise for everything: when a word is not recognized by the french dictionary, ask that the apostrophe be considered as a divider ("séparateur") and only take into account the second part (the one after the apostrophe) in the word count.

Code:

if "’" in word: word = word[word.find("’")+1:]
if "'" in word: word = word[word.find("'")+1:]

Ce code n’est pas vraiment optimisé. Il faut le tester. Cela ira peut-être plus vite avec une expression régulière.
Les deux lignes ne sont pas forcément nécessaires; cela dépend de la façon dont sont gérées les apostrophes, si elles sont transformées avant de faire la demande à Hunspell.

This code is not really optimized and has to be tested. Maybe it would be quicker using a regex. The two lines maybe are not needed, it depends on how you are managing apostrophes, if you transform them before asking Hunspell.

2. Plus exact, mais plus coûteux en ressources, j’imagine : éliminer par expression régulière tout ce qui commence par l’, d’, j’, c’, s’, t’, m’, qu’, etc. avant de regarder si le mot existe et de le comptabiliser dans les mots inconnus. (Plus coûteux en ressources, mais ça ne devrait pas être rédhibitoire.)

2. More precise, but more taxing on resources, probably: dismiss using a regex whatever is beginning with l’, d’, j’, c’, s’, t’, m’, qu’, etc. before looking if it is an existing word and count it within the unknown words (though more taxing on resources, it should not be too hard).

Code:

rElidedPrefix = re.compile(u"(?i)^(l|d|m|t|s|j|c|ç|qu)['’]")
[...]
word = rElidedPrefix.sub("", word)

Et il faudrait spécifier au moteur de reconnaissance de ne faire ceci que pour le français.

It would be necessary to tell the engine to do this only for French.

Le code est générique et n’est vraisemblablement pas du tout adapté à son propre code. Je n’ai pas le temps de me plonger dans son programme. Mais avec ça, il comprendra ce qui est nécessaire pour le français.

This is generic code. He hopes you'll understand with this what is necessary.

04-19-2014, 01:53 AM	#42
roger64 Wizard Posts: 2,608 Karma: 3000161 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	Hi Dealing with French apostrophes As you probably know, French need to use normally curly apostrophes (named also "typographic") in their texts, though many use also straight ones... so practically, there is a need for taking into account the two forms. I gave an EPUB to the French author of Grammalecte, Olivier_R., to comment on the problem about elided forms. He gave me the following reply. I kept the original French text and gave under each paragraph an English translation. Here it is: Deux solutions : Two solutions: 1. La plus simple, mais pas forcément adéquate dans tous les cas : lorsqu’un mot n’est pas reconnu par le dictionnaire français, demandez à ce que l’apostrophe soit considérée comme un séparateur et ne retenir que la deuxième partie (celle qui suit l’apostrophe) dans le décompte des mots. 1. The simplest, but maybe not the most precise for everything: when a word is not recognized by the french dictionary, ask that the apostrophe be considered as a divider ("séparateur") and only take into account the second part (the one after the apostrophe) in the word count. Code: if "’" in word: word = word[word.find("’")+1:] if "'" in word: word = word[word.find("'")+1:] Ce code n’est pas vraiment optimisé. Il faut le tester. Cela ira peut-être plus vite avec une expression régulière. Les deux lignes ne sont pas forcément nécessaires; cela dépend de la façon dont sont gérées les apostrophes, si elles sont transformées avant de faire la demande à Hunspell. This code is not really optimized and has to be tested. Maybe it would be quicker using a regex. The two lines maybe are not needed, it depends on how you are managing apostrophes, if you transform them before asking Hunspell. 2. Plus exact, mais plus coûteux en ressources, j’imagine : éliminer par expression régulière tout ce qui commence par l’, d’, j’, c’, s’, t’, m’, qu’, etc. avant de regarder si le mot existe et de le comptabiliser dans les mots inconnus. (Plus coûteux en ressources, mais ça ne devrait pas être rédhibitoire.) 2. More precise, but more taxing on resources, probably: dismiss using a regex whatever is beginning with l’, d’, j’, c’, s’, t’, m’, qu’, etc. before looking if it is an existing word and count it within the unknown words (though more taxing on resources, it should not be too hard). Code: rElidedPrefix = re.compile(u"(?i)^(l\|d\|m\|t\|s\|j\|c\|ç\|qu)['’]") [...] word = rElidedPrefix.sub("", word) Et il faudrait spécifier au moteur de reconnaissance de ne faire ceci que pour le français. It would be necessary to tell the engine to do this only for French. Le code est générique et n’est vraisemblablement pas du tout adapté à son propre code. Je n’ai pas le temps de me plonger dans son programme. Mais avec ça, il comprendra ce qui est nécessaire pour le français. This is generic code. He hopes you'll understand with this what is necessary. Last edited by roger64; 04-19-2014 at 10:36 AM.