View Single Post
Old 09-21-2016, 03:33 AM   #6
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,625
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Hi

and here is a new version of this function taking into account the elided forms (at least for French language) thanks to Olivier, the author of -opensource- Grammalecte.

Spoiler:

Code:
import regex
from calibre import replace_entities, prepare_string_for_xml
 
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    def fix_word(m):
        word = m.group()
        if dictionaries.recognized(word):
            return word
        for i in xrange(1, len(word) - 1):
            a, b = word[:i], word[i:]
            if dictionaries.recognized(a) and dictionaries.recognized(b):
                return a + ' ' + b
        m = regex.match(r"(\w+)((?:[dlnmts]|qu(?:oi|el)qu|puisqu|lorsqu|jusqu|qu)[’'`]\w+)", word)
        if m:
            return m.group(1) + " " + m.group(2)
        return word
    text = replace_entities(match.group(1))
    text = regex.sub(r"\b\w(?:[\w’'`-]*\w|\w+)\b", fix_word, text, flags=regex.VERSION1)
    text = prepare_string_for_xml(text)
    return '>' + text + '<'

or here: http://pastebin.com/quGQQzcN

Last edited by roger64; 09-21-2016 at 06:10 AM. Reason: pastebin
roger64 is offline   Reply With Quote