MobileRead Forums - View Single Post

DiapDealer · 09-16-2020, 11:36 AM

Quote:

Originally Posted by leschek

Thank you, it works partialy, but it does find also parts of html code as

Code:

<a href...

and words ending with searched characters with previous character from non English alphabet as nás, při etc.

I'm tackling your exceptions in reverse order.

To make \b honor unicode codepoints, turn on the Unicode Character Properties flag with (*UCP)

So the above"

Code:

\b([aiouksvz])\s

becomes:

Code:

(*UCP)\b([aiouksvz])\s

This should exclude the 'i' and the 'a' characters in your 'nás' and 'při' examples

To make the expression ignore the character class matches that immediately follow an angled (x)html bracket (<) you can use a negative lookbehind. Something like:

Code:

(*UCP)(?<!\<)\b([aiouksvz])\s

should ignore the 'a' and 'i' characters used in (x)html's anchor and italic tags.

The (*UCP) flag and the (?<!\<) lookbehind are not captured groups despite the appearance. So the replacement you're looking for will still be something like:

Code:

\1&nbsp;