View Single Post
Old 12-22-2021, 03:30 PM   #4
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 170
Karma: 1497966
Join Date: Jul 2021
Device: N/A
You want to see if your acro is at the end of a paragraph (A.B.</p>) or of a sentence (A.B. Then something) or if the is a word in lowercase

I propose you a regex function, working with this regex :
Code:
((?:\p{Lu}\.){2,})(?:(<(?:/p|/div|br/|/blockquote)>)|( \p{Lu})|( \p{Ll}))
\p{Lu} is an uppercase letter, \p{Ll} lowercase, (?: is a non-capturing group.
(?:\p{Lu}\.){2,})) will capture in match.group(1) all acro. with at least 2 letters (put 3 if you want to start with acro of 3 letters).

The function will put a period or not,, depending of what is after your acronym. It is possible that it doesn't cover all cases, it's to you to check. It would be wise to polish the book first, to avoid unexpected end of line, or space before </p>, etc.
Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    acro = match.group(1).replace('.', '')
    if  end := match.group(2):	# </p> or <br/> etc.
        period = '.'
    elif  end := match.group(3):	# <space>[A-Z]
        period = '.'
    elif  end := match.group(4):	# <space>[a-z]
        period = ''
    else:
        end = ''
        period = ''
  
    return acro + period + end
See if you want to consider other cases. Notice that, e.g., A.B. Lda will give AB. Lda, because of the capital letter of the next word

Last edited by lomkiri; 12-22-2021 at 04:25 PM.
lomkiri is offline   Reply With Quote