Quote:
Originally Posted by leschek
Thank you, it works partialy, but it does find also parts of html code as and words ending with searched characters with previous character from non English alphabet as nás, při etc.
|
I'm tackling your exceptions in reverse order.
To make \b honor unicode codepoints, turn on the Unicode Character Properties flag with (*UCP)
So the above"
becomes:
Code:
(*UCP)\b([aiouksvz])\s
This should exclude the 'i' and the 'a' characters in your 'nás' and 'při' examples
To make the expression ignore the character class matches that immediately follow an angled (x)html bracket (<) you can use a negative lookbehind. Something like:
Code:
(*UCP)(?<!\<)\b([aiouksvz])\s
should ignore the 'a' and 'i' characters used in (x)html's anchor and italic tags.
The (*UCP) flag and the (?<!\<) lookbehind are not captured groups despite the appearance. So the replacement you're looking for will still be something like: