Regex examples - Page 49

Skydancer · 05-05-2022, 05:47 AM

Any idea on how to capture uppercase words with special diacritic characters, like Ū Ṃ Ḥ Ū etc.?
I tried the following, but it doesn't work. I want to capture uppercase words with 2 or more characters.

Code:

([[:upper:]]{2,})

BeckyEbook · 05-05-2022, 06:09 AM

(*UCP) enables unicode properties for the expression that follows. [*]

Use:

Code:

(*UCP)([[:upper:]]{2,})

Skydancer · 05-05-2022, 06:15 AM

@BeckyEbook, thank you!

DiapDealer · 05-05-2022, 06:55 AM

Also remember that \p{Lu} and \p{Ll} can be used to match any uppercase (and consequently, lowercase) letter in any language without requiring the *UCP switch (in Sigil's PCRE regex engine).

\p{L} matches any letter (Unicode or otherwise) and \P{L} matches anything NOT a letter.

So (\p{Lu}{2,}) should theoretically do the same thing (not near a machine to verify syntax).

See the Unicode Categories section of https://www.regular-expressions.info/unicode.html for more categories.

CubGeek · 08-18-2022, 01:51 PM

oh.... wow. 49 pages over the course of ten years?! well, this Regex newbie's got a lot of reading homework, it seems.

CubGeek · 08-18-2022, 02:48 PM

Okay, after reading the , or for italics thread from 2020, and then reading the Extended <head> chapter: NOT necessary? 2017 thread linked therein [and paying particular attention to Tex2002ans posting about the underlying purposes for and therein (

) ], I've seen the error of my ways regarding using for setting italics.

The up side: I've only just started dipping my toes into the waters with converting my documents into ePub format, so I'm learning good things!
The down side: I now need to learn Regex to be able to search through the files to correct my earlier . Thanks, karma.

I've figured out that

Code:

<span class="italics">([^>]+)</span>

will catch every instance of the offending tags on both sides of the content so affected. However, I can't seem to figure out how to get the REPLACE function to leave the content alone and replace just the tags themselves.

I'm happy to do the legwork and the trial-and-error to learn what works. I guess my search skills also need an update, too, because the results I am turning up don't seem to work for me.

Can someone help point me in the right direction?

[edit] Okay, I THINK I found it, but it was hit-or miss, because it seemed that everything was for Javascript/C##/VB.net/PHP/ruby/etc.

so, it seems that some trial-and-error resulted in me learning about backreferences and capture groups. I've gotten it to work so that

Code:

<em>\g<1></em>

works. whew.

Okay, next question: is this a kludge and there's a better way? or is this correct? Thanks, y'all! [/edit]

Turtle91 · 08-18-2022, 04:32 PM

That's pretty advanced stuff!

I go pretty easy...and it seems to work so far...

find: (.*?)
replace: \1

or

find: (.*?)
replace: \1

etc.

CubGeek · 08-18-2022, 10:11 PM

Quote:

Originally Posted by Turtle91

That's pretty advanced stuff!

I go pretty easy...and it seems to work so far...

find: (.*?)
replace: \1

or

find: (.*?)
replace: \1

etc.

Oh, that's much simpler. Thank you! Since the stuff I'm working on has a combination of for "inside voice," and "named things" as well as for word emphasis, this certainly has been a learning experience!

Tex2002ans · 08-18-2022, 11:04 PM

Quote:

Originally Posted by CubGeek

Okay, after reading the , or for italics thread from 2020 [...] [and paying particular attention to Tex2002ans posting about the underlying purposes for and therein (

) ], I've seen the error of my ways regarding using for setting italics.

The easiest way to do it is to use DiapDealer's fantastic "TagMechanic" plugin.

I explained how to install Sigil plugins in this 2021 post.

And I gave step-by-step instructions on how to use TagMechanic here:

2020: "Replacing All Tags Of One Kind With Another"

That will help mass convert your -> or .

It will be much safer than trying to use Regular Expressions, because regex can't safely handle complicated cases of s inside of s.

Quote:

Originally Posted by CubGeek

I've figured out that

Code:

<span class="italics">([^>]+)</span>

Find: ([^<]+)
Replace: \1

You see the parentheses you wrapped around your stuff? That's called a "Capture Group".

Explanation of the Find

Let's break it down into each piece:

(
- [^<]+
)

It's saying:

"Hey, find the italics ."
"You see this open parenthesis? Stick this next stuff into a group!"
- "Keep grabbing everything that's NOT a '<'.
"Closing parenthesis? Everything captured between them goes into GROUP 1!"
"Hey, find the closing ."

Now when you're Replacing, you can use \1 to get "Group #1".

Explanation of the Replace

= "Put the opening ."
\1 = "Put whatever was captured in GROUP 1 here."
= "Put the closing ."

- - -

Side Note: If you have more complicated regex, you can get up to 9 capture groups!

\1, \2, \3, [...], \9

But at that point, it's probably smarter to split your search/replaces into smaller pieces.

- - -

Side Note #2: If you want some more Regex tricks, I just wrote a post a few months ago here:

2022: "Need help with regex"

which linked to some of my other posts over the years. I break down + color-coordinate many of the ones I use.

Quote:

Originally Posted by Turtle91

I go pretty easy...and it seems to work so far...

find: (.*?)
replace: \1

or

find: (.*?)
replace: \1

Yep, this type of stuff works too.

Easier/Safer to use Tag Mechanic though. :P

Quote:

Originally Posted by CubGeek

Since the stuff I'm working on has a combination of for "inside voice," and "named things" as well as for word emphasis, this certainly has been a learning experience!

And I don't know if you caught this topic:

2021: "Italics and Bold"

where I explained differences between + even further.

CubGeek · 08-19-2022, 11:43 AM

Quote:

Originally Posted by Tex2002ans

The easiest way to do it is to use DiapDealer's fantastic "TagMechanic" plugin.

I explained how to install Sigil plugins in this 2021 post.

And I gave step-by-step instructions on how to use TagMechanic here:

2020: "Replacing All Tags Of One Kind With Another"

Cheers, that'll help significantly. Luckily, the few things I'm crafting are small enough, and I'm doing them slow enough, that there isn't much "spaghettification" of the code, or the whole ception of nested s thing that I've seen when I peeked inside a couple of my purchased or calibre-converted books.

Quote:

You see the parentheses you wrapped around your stuff? That's called a "Capture Group".

Yup! Note my edit above where I learned about Capture Groups and backreferences and...

However, I like your explanation better.

Much more user friendly.

Quote:

Side Note #2: If you want some more Regex tricks, I just wrote a post a few months ago here:

2022: "Need help with regex"

which linked to some of my other posts over the years. I break down + color-coordinate many of the ones I use.

bookmarked!

Quote:

And I don't know if you caught this topic:

2021: "Italics and Bold"

where I explained differences between + even further.

Oh, I did. *twitch*

I'm sure I was mumbling about em's and i's and strong's and b's (oh my!) in my sleep to the annoyance of my cats

Tex2002ans · 08-19-2022, 02:09 PM

Quote:

Originally Posted by CubGeek

Cheers, that'll help significantly. Luckily, the few things I'm crafting are small enough, and I'm doing them slow enough, that there isn't much "spaghettification" of the code, or the whole ception of nested s thing that I've seen when I peeked inside a couple of my purchased or calibre-converted books.

It usually happens around footnotes and all sorts of other complicated nesting:

Code:

<p class="normal"><span class="normal">This is an <span class="italics">example</span>.<sup><span class="tiny">1</span></sup></span></p>

Let's say you were trying to correct (or remove) that outside .

Regular Expressions would get completely confused with the 3 different s, where TagMechanic would be able to figure out which connects with which one.

Of course, with clean code, this wouldn't be a problem, but in real life there's always these crazy examples that creep up... and it comes to bite you in the butt later when you already accidentally did a "Replace All" 3 hours ago!

Quote:

Originally Posted by CubGeek

Yup! Note my edit above where I learned about Capture Groups and backreferences and...

However, I like your explanation better.

Much more user friendly.

You can also use those in FINDs as well!

For example, one of the tricks I use is:

Double Word Check

Find: (\b[a-z]+) (\1\b)
Replace: \1

This grabs a lowercase word + looks for it again:

Did you see the reactor reactor?
What are you doing in that that area?
If only they had had enough power to use the ultrasound machine for each pregnancy, he would have detected the problem earlier and been able to plan the C-section.

How does it work?

It uses a few tricks:

\b = a "word boundary". (Beginning of word)
[a-z] = lowercase letters 'a' through 'z'.
+ = ONE OR MORE of previous thing.

Shove all that in GROUP 1.

\1 = Look for GROUP 1 again.
\b = a "word boundary". (End of word)

Shove all that in GROUP 2.

Now, when you replace, you're only replacing with GROUP 1, meaning that duplicated word never makes it:

Did you see the reactor?
What are you doing in that area?

- - -

Usage Note: You do have to be careful of false positives though, so NEVER do a "Replace All".

Always do a one-by-one check.

There shouldn't ever be too many "doubles" within your book, but they're an extremely common typo that's very hard to catch. (Usually the human brain just skips right over them.)

- - -

Quote:

Originally Posted by CubGeek

Oh, I did. *twitch*

I'm sure I was mumbling about em's and i's and strong's and b's (oh my!) in my sleep

Me too. Took me many years to finally get it boiled down.

Glad to see someone benefited from all those in-depth discussions.

JSWolf · 08-19-2022, 02:25 PM

Use and and forget and ever existed.

DiapDealer · 08-19-2022, 02:35 PM

Drop it Jon. Your preferences are not really relevant to the conversation at hand.

CubGeek · 08-19-2022, 04:28 PM

Quote:

Originally Posted by JSWolf

Use and and forget and ever existed.

After reading threads that spanned (ha! ned!

) 5+ years, and seeing you spouting the same thing about and and and (regardless of being educated better), I'll at least give you credit for consistency. But that's all. Thanks for your input.

CubGeek · 08-19-2022, 04:30 PM

Quote:

Originally Posted by Tex2002ans

Glad to see someone benefited from all those in-depth discussions.

Having spent 5 years working for a boss who was blind and who used a screen-reader, I hope that I now have a better empathy for the difficulties she encountered than before my time there. Not only with official communications, but with webpage navigation, with poorly-implemented accessibility "functions," and also with the simple pleasure of "reading" a book over her lunch break.

So, if my learning how to properly show varying types of emphasis to help convey nuances for someone who's relying on a screen-reader or similar (on the very infinitesimal chance they access something that I put together) then it was time well-spent.

05-05-2022, 05:47 AM	#721
Skydancer Enthusiast Posts: 30 Karma: 10 Join Date: Mar 2019 Location: Slovenia Device: PocketBoot Inkpad 3	Any idea on how to capture uppercase words with special diacritic characters, like Ū Ṃ Ḥ Ū etc.? I tried the following, but it doesn't work. I want to capture uppercase words with 2 or more characters. Code: ([[:upper:]]{2,})

05-05-2022, 06:09 AM	#722
BeckyEbook Guru Posts: 899 Karma: 3501166 Join Date: Jan 2017 Location: Poland Device: Various	(UCP) enables unicode properties for the expression that follows. [] Use: Code: (*UCP)([[:upper:]]{2,})

08-18-2022, 02:48 PM	#726
CubGeek Connoisseur Posts: 52 Karma: 10 Join Date: Sep 2021 Location: Upstate NY, USA Device: iPad Pro, Kindle basic	Okay, after reading the <i>, <em> or <span> for italics thread from 2020, and then reading the Extended <head> chapter: NOT necessary? 2017 thread linked therein [and paying particular attention to Tex2002ans posting about the underlying purposes for <em> and <i> <em>therein</em> () ], I've seen the error of my ways regarding using <span> for setting italics. The up side: I've only just started dipping my toes into the waters with converting my documents into ePub format, so I'm learning good things! The down side: I now need to learn Regex to be able to search through the files to correct my earlier <span class="abuse">. Thanks, karma. I've figured out that Code: <span class="italics">([^>]+)</span> will catch every instance of the offending tags on both sides of the content so affected. However, I can't seem to figure out how to get the REPLACE function to leave the content alone and replace <em>just</em> the tags themselves. I'm happy to do the legwork and the trial-and-error to learn what works. I guess my search skills also need an update, too, because the results I am turning up don't seem to work for me. Can someone help point me in the right direction? [edit] Okay, I THINK I found it, but it was hit-or miss, because it seemed that everything was for Javascript/C##/VB.net/PHP/ruby/etc. so, it seems that some trial-and-error resulted in me learning about <i>backreferences</i> and <i>capture groups</i>. I've gotten it to work so that Code: <em>\g<1></em> works. whew. Okay, next question: is this a kludge and there's a better way? or is this correct? Thanks, y'all! [/edit] Last edited by CubGeek; 08-18-2022 at 03:22 PM.

08-18-2022, 04:32 PM	#727
Turtle91 A Hairy Wizard Posts: 3,394 Karma: 20212733 Join Date: Dec 2012 Location: Charleston, SC today Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire	That's pretty advanced stuff! I go pretty easy...and it seems to work so far... find: <i>(.?)</i> replace: <em>\1</em> or find: <span class="italics>(.?)</span> replace: <em>\1</em> etc.

08-19-2022, 02:25 PM	#732
JSWolf Resident Curmudgeon Posts: 80,694 Karma: 150249619 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	Use <i> and <b> and forget <em> and <strong> ever existed.

05-05-2022, 06:15 AM	#723
Skydancer Enthusiast Posts: 30 Karma: 10 Join Date: Mar 2019 Location: Slovenia Device: PocketBoot Inkpad 3	@BeckyEbook, thank you!

05-05-2022, 06:55 AM	#724
DiapDealer Grand Sorcerer Posts: 28,869 Karma: 207000000 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Also remember that \p{Lu} and \p{Ll} can be used to match any uppercase (and consequently, lowercase) letter in any language without requiring the *UCP switch (in Sigil's PCRE regex engine). \p{L} matches any letter (Unicode or otherwise) and \P{L} matches anything NOT a letter. So (\p{Lu}{2,}) should theoretically do the same thing (not near a machine to verify syntax). See the Unicode Categories section of https://www.regular-expressions.info/unicode.html for more categories.

08-18-2022, 01:51 PM	#725
CubGeek Connoisseur Posts: 52 Karma: 10 Join Date: Sep 2021 Location: Upstate NY, USA Device: iPad Pro, Kindle basic	oh.... wow. 49 pages over the course of ten years?! well, this Regex newbie's got a lot of reading homework, it seems.

08-19-2022, 02:35 PM	#733
DiapDealer Grand Sorcerer Posts: 28,869 Karma: 207000000 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Drop it Jon. Your preferences are not really relevant to the conversation at hand.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Examples of Subgroups	emonti8384	Lounge	32	02-26-2011 06:00 PM
Accessories Pen examples	Gunnerp245	enTourage Archive	15	02-21-2011 03:23 PM
Stylesheet examples?	Skitzman69	Sigil	15	09-24-2010 08:24 PM
Examples	kafkaesque1978	iRiver Story	1	07-26-2010 03:49 PM
Looking for examples of typos in eBooks	Tonycole	General Discussions	1	05-05-2010 04:23 AM

Advert

Advert