Thread: Regex examples
View Single Post
Old 09-16-2020, 11:36 AM   #664
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 28,768
Karma: 206758686
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by leschek View Post
Thank you, it works partialy, but it does find also parts of html code as
Code:
<a href...
and words ending with searched characters with previous character from non English alphabet as nás, při etc.
I'm tackling your exceptions in reverse order.

To make \b honor unicode codepoints, turn on the Unicode Character Properties flag with (*UCP)

So the above"
Code:
\b([aiouksvz])\s
becomes:
Code:
(*UCP)\b([aiouksvz])\s
This should exclude the 'i' and the 'a' characters in your 'nás' and 'při' examples

To make the expression ignore the character class matches that immediately follow an angled (x)html bracket (<) you can use a negative lookbehind. Something like:
Code:
(*UCP)(?<!\<)\b([aiouksvz])\s
should ignore the 'a' and 'i' characters used in (x)html's anchor and italic tags.

The (*UCP) flag and the (?<!\<) lookbehind are not captured groups despite the appearance. So the replacement you're looking for will still be something like:
Code:
\1&nbsp;

Last edited by DiapDealer; 09-17-2020 at 10:04 AM. Reason: Edited to correct the full expression
DiapDealer is offline   Reply With Quote