You want to see if your acro is at the end of a paragraph (A.B.</p>) or of a sentence (A.B. Then something) or if the is a word in lowercase
I propose you a regex function, working with this regex :
Code:
((?:\p{Lu}\.){2,})(?:(<(?:/p|/div|br/|/blockquote)>)|( \p{Lu})|( \p{Ll}))
\p{Lu} is an uppercase letter, \p{Ll} lowercase, (?: is a non-capturing group.
(?:\p{Lu}\.){2,})) will capture in match.group(1) all acro. with at least 2 letters (put 3 if you want to start with acro of 3 letters).
The function will put a period or not,, depending of what is after your acronym. It is possible that it doesn't cover all cases, it's to you to check. It would be wise to polish the book first, to avoid unexpected end of line, or space before </p>, etc.
Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
acro = match.group(1).replace('.', '')
if end := match.group(2): # </p> or <br/> etc.
period = '.'
elif end := match.group(3): # <space>[A-Z]
period = '.'
elif end := match.group(4): # <space>[a-z]
period = ''
else:
end = ''
period = ''
return acro + period + end
See if you want to consider other cases. Notice that, e.g., A.B. Lda will give AB. Lda, because of the capital letter of the next word