Quote:
Originally Posted by Vroni
i've a rather long regex having more then 9 groups. How do i refer to group number 10?
|
I would be interested in what you're trying to do that requires more than 10 groups?
Quote:
Originally Posted by Doitsu
Sigil uses the PCRE library, which supports named subpatterns.
|
Fascinating. Had no idea about subpatterns.
But once you reach 10 groups, it's probably best to break the Regex down into smaller, more understandable chunks.
Quote:
Originally Posted by roger64
About small caps.
On a big bibliographical file, I have about 300 names of authors written with caps like ZOLA and LA BRUYÈRE and BALZAC and so on.
I would like to write a regex that would allow me to write each name with small caps this way:
Code:
<span class="smcp">La Bruyère</span>
|
This is similar to one I use:
Smallcaps Unicode:
Search: (*UCP)([[:upper:]])([[:upper:]]{2,})
Replace: <span class="smallcaps">\1\L\2\E</span>
Definitely don't Replace All while using this one, as there can be many false positives.
What each part is doing, in plain English:
(*UCP) = This tells PCRE to be "unicode aware". Allows you to get those accented characters, like È.
[[:upper:]] = Grabs the first uppercase character. (Becomes Group 1)
[[:upper:]]{2,} = Grabs the next 2 or more uppercase characters. (Becomes Group 2)
* * *
Side Note: It won't work on two letter ALL CAPS words, but that can easily be adjusted by changing the {2,} into a {1,}:
Search: (*UCP)([[:upper:]])([[:upper:]]{1,})
(I do this in a later pass, because there's probably edge cases, like "DE" -> "de".)
* * *
This Replace:
<span class="smallcaps">\1\L\2\E</span>
puts Group 1 (the first uppercase letter) back. Then says lowercase everything beyond that point.
In your example text:
Code:
<p>ZOLA and LA BRUYÈRE and BALZAC</p>
After running the Regex:
Code:
<p><span class="smallcaps">Zola</span> and LA <span class="smallcaps">Bruyère</span> and <span class="smallcaps">Balzac</span></p>
then you could run the {1,} variant to catch the "LA":
Code:
<p><span class="smallcaps">Zola</span> and <span class="smallcaps">La</span> <span class="smallcaps">Bruyère</span> and <span class="smallcaps">Balzac</span></p>
From there, you could do a search for:
Search: </span> <span class="smallcaps">
Replace: JUST-PUT-A-SPACE-HERE
that would merge "La Bruyère" into a single span.
And you would probably have to handle Middle Initials:
Search: </span> ([[:upper:]]\.) <span class="smallcaps">
Replace: \1
Note: Make sure to insert a space before and after this Replace.