MobileRead Forums - View Single Post

Tex2002ans · 07-25-2019, 05:25 PM

Quote:

Originally Posted by Vroni

i've a rather long regex having more then 9 groups. How do i refer to group number 10?

I would be interested in what you're trying to do that requires more than 10 groups?

Quote:

Originally Posted by Doitsu

Sigil uses the PCRE library, which supports named subpatterns.

Fascinating. Had no idea about subpatterns.

But once you reach 10 groups, it's probably best to break the Regex down into smaller, more understandable chunks.

Quote:

Originally Posted by roger64

About small caps.

On a big bibliographical file, I have about 300 names of authors written with caps like ZOLA and LA BRUYÈRE and BALZAC and so on.

I would like to write a regex that would allow me to write each name with small caps this way:

Code:

<span class="smcp">La Bruyère</span>

This is similar to one I use:

Smallcaps Unicode:

Search: (*UCP)([[:upper:]])([[:upper:]]{2,})
Replace: \1\L\2\E

Definitely don't Replace All while using this one, as there can be many false positives.

What each part is doing, in plain English:

(*UCP) = This tells PCRE to be "unicode aware". Allows you to get those accented characters, like È.

[[:upper:]] = Grabs the first uppercase character. (Becomes Group 1)

[[:upper:]]{2,} = Grabs the next 2 or more uppercase characters. (Becomes Group 2)

* * *

Side Note: It won't work on two letter ALL CAPS words, but that can easily be adjusted by changing the {2,} into a {1,}:

Search: (*UCP)([[:upper:]])([[:upper:]]{1,})

(I do this in a later pass, because there's probably edge cases, like "DE" -> "de".)

* * *

This Replace:

\1\L\2\E

puts Group 1 (the first uppercase letter) back. Then says lowercase everything beyond that point.

In your example text:

Code:

<p>ZOLA and LA BRUYÈRE and BALZAC</p>

After running the Regex:

Code:

<p><span class="smallcaps">Zola</span> and LA <span class="smallcaps">Bruyère</span> and <span class="smallcaps">Balzac</span></p>

then you could run the {1,} variant to catch the "LA":

Code:

<p><span class="smallcaps">Zola</span> and <span class="smallcaps">La</span> <span class="smallcaps">Bruyère</span> and <span class="smallcaps">Balzac</span></p>

From there, you could do a search for:

Search: 
Replace: JUST-PUT-A-SPACE-HERE

that would merge "La Bruyère" into a single span.

And you would probably have to handle Middle Initials:

Search: ([[:upper:]]\.) 
Replace: \1

Note: Make sure to insert a space before and after this Replace.