MobileRead Forums - View Single Post

lomkiri · 12-22-2021, 03:30 PM

You want to see if your acro is at the end of a paragraph (A.B.</p>) or of a sentence (A.B. Then something) or if the is a word in lowercase

I propose you a regex function, working with this regex :

Code:

((?:\p{Lu}\.){2,})(?:(<(?:/p|/div|br/|/blockquote)>)|( \p{Lu})|( \p{Ll}))

\p{Lu} is an uppercase letter, \p{Ll} lowercase, (?: is a non-capturing group.
(?:\p{Lu}\.){2,})) will capture in match.group(1) all acro. with at least 2 letters (put 3 if you want to start with acro of 3 letters).

The function will put a period or not,, depending of what is after your acronym. It is possible that it doesn't cover all cases, it's to you to check. It would be wise to polish the book first, to avoid unexpected end of line, or space before </p>, etc.

Code:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    acro = match.group(1).replace('.', '')
    if  end := match.group(2):	# </p> or <br/> etc.
        period = '.'
    elif  end := match.group(3):	# <space>[A-Z]
        period = '.'
    elif  end := match.group(4):	# <space>[a-z]
        period = ''
    else:
        end = ''
        period = ''
  
    return acro + period + end

See if you want to consider other cases. Notice that, e.g., A.B. Lda will give AB. Lda, because of the capital letter of the next word

12-22-2021, 03:30 PM	#4
lomkiri Groupie Posts: 170 Karma: 1497966 Join Date: Jul 2021 Device: N/A	You want to see if your acro is at the end of a paragraph (A.B.</p>) or of a sentence (A.B. Then something) or if the is a word in lowercase I propose you a regex function, working with this regex : Code: ((?:\p{Lu}\.){2,})(?:(<(?:/p\|/div\|br/\|/blockquote)>)\|( \p{Lu})\|( \p{Ll})) \p{Lu} is an uppercase letter, \p{Ll} lowercase, (?: is a non-capturing group. (?:\p{Lu}\.){2,})) will capture in match.group(1) all acro. with at least 2 letters (put 3 if you want to start with acro of 3 letters). The function will put a period or not,, depending of what is after your acronym. It is possible that it doesn't cover all cases, it's to you to check. It would be wise to polish the book first, to avoid unexpected end of line, or space before </p>, etc. Code: def replace(match, number, file_name, metadata, dictionaries, data, functions, args, kwargs): acro = match.group(1).replace('.', '') if end := match.group(2): # </p> or <br/> etc. period = '.' elif end := match.group(3): # <space>[A-Z] period = '.' elif end := match.group(4): # <space>[a-z] period = '' else: end = '' period = '' return acro + period + end See if you want to consider other cases. Notice that, e.g., A.B. Lda will give AB. Lda, because of the capital letter of the next word Last edited by lomkiri; 12-22-2021 at 04:25 PM.*