Thread: Regex examples
View Single Post
Old 07-25-2019, 05:25 PM   #588
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Vroni View Post
i've a rather long regex having more then 9 groups. How do i refer to group number 10?
I would be interested in what you're trying to do that requires more than 10 groups?

Quote:
Originally Posted by Doitsu View Post
Sigil uses the PCRE library, which supports named subpatterns.
Fascinating. Had no idea about subpatterns.

But once you reach 10 groups, it's probably best to break the Regex down into smaller, more understandable chunks.

Quote:
Originally Posted by roger64 View Post
About small caps.

On a big bibliographical file, I have about 300 names of authors written with caps like ZOLA and LA BRUYÈRE and BALZAC and so on.

I would like to write a regex that would allow me to write each name with small caps this way:

Code:
<span class="smcp">La Bruyère</span>
This is similar to one I use:

Smallcaps Unicode:

Search: (*UCP)([[:upper:]])([[:upper:]]{2,})
Replace: <span class="smallcaps">\1\L\2\E</span>

Definitely don't Replace All while using this one, as there can be many false positives.

What each part is doing, in plain English:

(*UCP) = This tells PCRE to be "unicode aware". Allows you to get those accented characters, like È.

[[:upper:]] = Grabs the first uppercase character. (Becomes Group 1)

[[:upper:]]{2,} = Grabs the next 2 or more uppercase characters. (Becomes Group 2)

* * *

Side Note: It won't work on two letter ALL CAPS words, but that can easily be adjusted by changing the {2,} into a {1,}:

Search: (*UCP)([[:upper:]])([[:upper:]]{1,})

(I do this in a later pass, because there's probably edge cases, like "DE" -> "de".)

* * *

This Replace:

<span class="smallcaps">\1\L\2\E</span>

puts Group 1 (the first uppercase letter) back. Then says lowercase everything beyond that point.

In your example text:

Code:
<p>ZOLA and LA BRUYÈRE and BALZAC</p>
After running the Regex:

Code:
<p><span class="smallcaps">Zola</span> and LA <span class="smallcaps">Bruyère</span> and <span class="smallcaps">Balzac</span></p>
then you could run the {1,} variant to catch the "LA":

Code:
<p><span class="smallcaps">Zola</span> and <span class="smallcaps">La</span> <span class="smallcaps">Bruyère</span> and <span class="smallcaps">Balzac</span></p>
From there, you could do a search for:

Search: </span> <span class="smallcaps">
Replace: JUST-PUT-A-SPACE-HERE

that would merge "La Bruyère" into a single span.

And you would probably have to handle Middle Initials:

Search: </span> ([[:upper:]]\.) <span class="smallcaps">
Replace: \1

Note: Make sure to insert a space before and after this Replace.

Last edited by Tex2002ans; 07-25-2019 at 05:28 PM.
Tex2002ans is offline   Reply With Quote