Regex examples - Page 40

Doitsu · 07-19-2019, 04:19 PM

Quote:

Originally Posted by Vroni

i've a rather long regex having more then 9 groups. How do i refer to group number 10?

Sigil uses the PCRE library, which supports named subpatterns.

For example, if your text contains:

Code:

123456789ABCDEF

and you search for:

Code:

(.)(.)(.)(.)(.)(.)(.)(.)(.)(?<a>.)(?<b>.)(?<c>.)(?<d>.)(?<e>.)(?<f>.)

and replace it with:

Code:

\g{f}\g{e}\g{d}\g{c}\g{b}\g{a}

you'll end up with:

Code:

FEDCBA

roger64 · 07-25-2019, 04:28 PM

About small caps.

On a big bibliographical file, I have about 300 names of authors written with caps like ZOLA and LA BRUYÈRE and BALZAC and so on.

I would like to write a regex that would allow me to write each name with small caps this way:

Code:

<span class="smcp">La Bruyère</span>

I remember having used such a regex some years ago but unhappily I lost all these regexes.

Tex2002ans · 07-25-2019, 05:25 PM

Quote:

Originally Posted by Vroni

i've a rather long regex having more then 9 groups. How do i refer to group number 10?

I would be interested in what you're trying to do that requires more than 10 groups?

Quote:

Originally Posted by Doitsu

Sigil uses the PCRE library, which supports named subpatterns.

Fascinating. Had no idea about subpatterns.

But once you reach 10 groups, it's probably best to break the Regex down into smaller, more understandable chunks.

Quote:

Originally Posted by roger64

About small caps.

On a big bibliographical file, I have about 300 names of authors written with caps like ZOLA and LA BRUYÈRE and BALZAC and so on.

I would like to write a regex that would allow me to write each name with small caps this way:

Code:

<span class="smcp">La Bruyère</span>

This is similar to one I use:

Smallcaps Unicode:

Search: (*UCP)([[:upper:]])([[:upper:]]{2,})
Replace: <span class="smallcaps">\1\L\2\E</span>

Definitely don't Replace All while using this one, as there can be many false positives.

What each part is doing, in plain English:

(*UCP) = This tells PCRE to be "unicode aware". Allows you to get those accented characters, like È.

[[:upper:]] = Grabs the first uppercase character. (Becomes Group 1)

[[:upper:]]{2,} = Grabs the next 2 or more uppercase characters. (Becomes Group 2)

* * *

Side Note: It won't work on two letter ALL CAPS words, but that can easily be adjusted by changing the {2,} into a {1,}:

Search: (*UCP)([[:upper:]])([[:upper:]]{1,})

(I do this in a later pass, because there's probably edge cases, like "DE" -> "de".)

* * *

This Replace:

<span class="smallcaps">\1\L\2\E</span>

puts Group 1 (the first uppercase letter) back. Then says lowercase everything beyond that point.

In your example text:

Code:

<p>ZOLA and LA BRUYÈRE and BALZAC</p>

After running the Regex:

Code:

<p><span class="smallcaps">Zola</span> and LA <span class="smallcaps">Bruyère</span> and <span class="smallcaps">Balzac</span></p>

then you could run the {1,} variant to catch the "LA":

Code:

<p><span class="smallcaps">Zola</span> and <span class="smallcaps">La</span> <span class="smallcaps">Bruyère</span> and <span class="smallcaps">Balzac</span></p>

From there, you could do a search for:

Search: </span> <span class="smallcaps">
Replace: JUST-PUT-A-SPACE-HERE

that would merge "La Bruyère" into a single span.

And you would probably have to handle Middle Initials:

Search: </span> ([[:upper:]]\.) <span class="smallcaps">
Replace: \1

Note: Make sure to insert a space before and after this Replace.

roger64 · 07-26-2019, 12:41 AM

@Tex2002ans

Thank you very much for your detailed answer and explanation!! Much appreciated.

It worked pretty well for all 240 items.

On the remaining part of the text, I have numerous XVI that I wished to set to:

Code:

<span class="smcp">xvi</span>

I could reuse the second part of your search term for this:

Code:

<span class="smcp">\L\1\E</span>

I could check it on the wonderful site regex101 which implements also a PERL flavour.

Tex2002ans · 07-26-2019, 03:46 AM

Since Roman Numerals only use a handful of characters, I always just use something along these lines:

Search: \b([XIV]{2,})\b
Replace: <span class="romannumeral">\1</span>

That will find you most. (You can add LCDM if you want... but I've yet to see a book really go that high.)

Note #1: To get single character roman numerals, you'll have to be much more careful. I usually split it into different searches, because 'I' is such a common English word.

So something like this:

Search: \b([XV]+)\b

would find more single Roman Numerals.

Note #2: Roman Numerals usually also depend on some further context:

King Charles V ruled from [...]
See Chapter I.
The reference is on p. v.

so I use the specifics of the book to help refine the searches:

Search: Chapter ([XIV]+)
Replace: Chapter <span class="romannumeral">\1</span>

odamizu · 08-05-2019, 12:05 AM

Another regex question if you will indulge me:

What's the difference between (?U) and ?

Is there an advantage to using (?U).* rather than .*? or vice versa?

Thank you!

DNSB · 08-05-2019, 12:54 AM

Quote:

Originally Posted by odamizu

Another regex question if you will indulge me:

What's the difference between (?U) and ?

Is there an advantage to using (?U).* rather than .*? or vice versa?

Thank you!

If you are dealing with Unicode and using Python2, ?U would be useful (it enables Unicode for various options and makes ignorecase use non-ASCII matching) Likely documented elsewhere but check 7.2. re — Regular expression operations for more information. Please note that it is not the same as ? and is not used the same way -- (?u) says to treat the pattern and input as Unicode so it modifies how the input and pattern are treated but is not part of those strings.

So something like (?u)(.*?) instead of (.*?) if you want to match on Unicode.

OTOH, I vaguely remember that Python3 matches on Unicode by default making (?u) and it's equivalents (re.U, re.UNICODE) obsolete.

odamizu · 08-05-2019, 01:15 AM

Quote:

Originally Posted by DNSB

If you are dealing with Unicode and using Python2, ?U would be useful (it enables Unicode for various options and makes ignorecase use non-ASCII matching) Likely documented elsewhere but check 7.2. re — Regular expression operations for more information. Please note that it is not the same as ? and is not used the same way -- (?u) says to treat the pattern and input as Unicode so it modifies how the input and pattern are treated but is not part of those strings.

So something like (?u)(.*?) instead of (.*?) if you want to match on Unicode.

OTOH, I vaguely remember that Python3 matches on Unicode by default making (?u) and it's equivalents (re.U, re.UNICODE) obsolete.

Now I'm really confused. I thought (?U) was a minimal match thing related to making something not greedy

This is in reference to the Minimal Match checkbox in Sigil's Find/Replace widget, and also to the default Example Saved Search for promoting/demoting Headings which adds (?sU) as a prefix:

Find: (?sU)<h2([^>]*>.*)</h2>
Replace: <h1\1</h1>

I'm trying to learn if an equivalent to the above Find might be:

(?s)<h2([^>]*>.*?)</h2>

If they're not equivalent, what's the difference and/or advantage to using (?U) vs .*? here?

Thank you

DiapDealer · 08-05-2019, 09:17 AM

The question mark can be a tricky bugger. When used after a character (or character class or grouping), it essentially makes what precedes it optional (technically it's a repetition operator meaning repeat the preceding 0, or 1 times).

When used after another repetition character like + or * its effect is to make that repetition character lazy instead of its default greedy: meaning match as little as possible.

The (?U) expression does not have anything to do with unicode when dealing with PCRE and mode modifiers. The question mark in this case is being used to signify different modes that may be turned on/off for expressions (or parts of expressions). And in this case, yes ... (?U) means turn on ungreedy mode. Which reverses the greediness/laziness of ALL repetition quantifiers. (?U)a* is lazy and (?U)a*? is greedy. In Sigil, including (?U) will also reverse the effect of checking the Minimal Match box.

So in your examples:
(?U)<h2([^>]*>.*)</h2>

Would be the same as:
<h2([^>]*?>.*?)</h2>

In this particular case, <h2([^>]*?> is essentially the same as <h2([^>]*> since the negated character-class [^>] prevents the * repetition character from extending beyond the the next '>' anyway.

(?s) is essentially the same as ticking the dotAll box in sigil. It treats everything as a single line because the dot character will match everything (including newline characters). It's opposite is (?m). These affect the special ^ and $ characters.

I rarely find the need to use (?Usm) myself. The dotAll and Minimal Match check boxes in Sigil achieve the same same thing. They can be handy if you need to turn on any of the modes for only certain portions of an expression, though.

In case you've not seen it: https://www.regular-expressions.info/tutorial.html is the best free regex resource that I've personally encountered on internet. Pretty-much everything I've picked up about regex comes from there.

DNSB · 08-05-2019, 12:27 PM

@diagdealer: Thanks for the correction. I ran into the (?u) as being Unicode related playing with Python a while back and only used ? for when I wanted non-greedy. The joys of trying to apply one regex implementation's documentation to other regex implementations.

DiapDealer · 08-05-2019, 12:45 PM

Quote:

Originally Posted by DNSB

@diagdealer: Thanks for the correction. I ran into the (?u) as being Unicode related playing with Python a while back and used ? for when I wanted ungreedy.

No problem. Easy enough assumption to make. All the short versions of the python regex flags align pretty well with the PCRE mode modifiers except re.U.

re.I (?i) ignore case.
re.S (?s) single line
re.M (?m) multiline

re.U turns on the unicode behavior of {\w , \W , \b , \B} in Python, but (?U) puts repetition characters in ungreedy mode in PCRE.

To turn on unicode support for operators like \w \W \d \b, etc in PCRE, you need to preface the expression with (*UCP). And that's if PCRE was compiled with unicode suport (which Sigil's PCRE clearly is).

theducks · 08-05-2019, 02:07 PM

remember the CASE of the letter changes what it does or mode in many places.

odamizu · 08-06-2019, 02:04 AM

Quote:

Originally Posted by DiapDealer

The question mark can be a tricky bugger. When used after a character (or character class or grouping), it essentially makes what precedes it optional (technically it's a repetition operator meaning repeat the preceding 0, or 1 times).

When used after another repetition character like + or * its effect is to make that repetition character lazy instead of its default greedy: meaning match as little as possible.

... (?U) means turn on ungreedy mode. Which reverses the greediness/laziness of ALL repetition quantifiers. (?U)a* is lazy and (?U)a*? is greedy. In Sigil, including (?U) will also reverse the effect of checking the Minimal Match box.

So in your examples:
(?U)<h2([^>]*>.*)</h2>

Would be the same as:
<h2([^>]*?>.*?)</h2>

In this particular case, <h2([^>]*?> is essentially the same as <h2([^>]*> since the negated character-class [^>] prevents the * repetition character from extending beyond the the next '>' anyway.

(?s) is essentially the same as ticking the dotAll box in sigil. It treats everything as a single line because the dot character will match everything (including newline characters). It's opposite is (?m). These affect the special ^ and $ characters.

This is awesome!

so much!

Quote:

In case you've not seen it: https://www.regular-expressions.info/tutorial.html is the best free regex resource that I've personally encountered on internet. Pretty-much everything I've picked up about regex comes from there.

Funny you should mention that site as I am a regular visitor there. However, it often confuses me. I think it's written for people who have more understanding than I have

roger64 · 08-06-2019, 02:24 AM

Quote:

Originally Posted by Tex2002ans

Smallcaps Unicode:

Search: (*UCP)([[:upper:]])([[:upper:]]{2,})
Replace: <span class="smallcaps">\1\L\2\E</span>

Definitely don't Replace All while using this one, as there can be many false positives.

What each part is doing, in plain English:

(*UCP) = This tells PCRE to be "unicode aware". Allows you to get those accented characters, like È.

[[:upper:]] = Grabs the first uppercase character. (Becomes Group 1)

[[:upper:]]{2,} = Grabs the next 2 or more uppercase characters. (Becomes Group 2)

.../...

The above regex works fine.

Maybe it's a little greedy because it also transforms words written in capitals which are included in the head like "DOCTYPE".

Is there a way to make it work strictly within body tags?

DiapDealer · 08-06-2019, 09:57 AM

Quote:

Originally Posted by roger64

Is there a way to make it work strictly within body tags?

The short answer is no.
The long answer is also no--it just takes longer to read.

Needing regex to stay between certain tags, or to only include stuff between tags means you gone beyond what plain regex can do for you. You've moved into the realm of parsing and algorithms. The Function Mode of the Calibre Editor's Search and Replace feature comes to mind.

07-25-2019, 04:28 PM	#587
roger64 Wizard Posts: 2,608 Karma: 3000161 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	About small caps. On a big bibliographical file, I have about 300 names of authors written with caps like ZOLA and LA BRUYÈRE and BALZAC and so on. I would like to write a regex that would allow me to write each name with small caps this way: Code: <span class="smcp">La Bruyère</span> I remember having used such a regex some years ago but unhappily I lost all these regexes.

07-26-2019, 12:41 AM	#589
roger64 Wizard Posts: 2,608 Karma: 3000161 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	@Tex2002ans Thank you very much for your detailed answer and explanation!! Much appreciated. It worked pretty well for all 240 items. On the remaining part of the text, I have numerous XVI that I wished to set to: Code: <span class="smcp">xvi</span> I could reuse the second part of your search term for this: Code: <span class="smcp">\L\1\E</span> I could check it on the wonderful site regex101 which implements also a PERL flavour. Last edited by roger64; 07-26-2019 at 02:27 AM. Reason: success

07-26-2019, 03:46 AM	#590
Tex2002ans Wizard Posts: 2,297 Karma: 12126329 Join Date: Jul 2012 Device: Kobo Forma, Nook	Since Roman Numerals only use a handful of characters, I always just use something along these lines: Search: \b([XIV]{2,})\b Replace: <span class="romannumeral">\1</span> That will find you most. (You can add LCDM if you want... but I've yet to see a book really go that high.) Note #1: To get single character roman numerals, you'll have to be much more careful. I usually split it into different searches, because 'I' is such a common English word. So something like this: Search: \b([XV]+)\b would find more single Roman Numerals. Note #2: Roman Numerals usually also depend on some further context: King Charles V ruled from [...] See Chapter I. The reference is on p. v. so I use the specifics of the book to help refine the searches: Search: Chapter ([XIV]+) Replace: Chapter <span class="romannumeral">\1</span> Last edited by Tex2002ans; 07-26-2019 at 03:53 AM.

08-05-2019, 09:17 AM	#594
DiapDealer Grand Sorcerer Posts: 27,549 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	The question mark can be a tricky bugger. When used after a character (or character class or grouping), it essentially makes what precedes it optional (technically it's a repetition operator meaning repeat the preceding 0, or 1 times). When used after another repetition character like + or * its effect is to make that repetition character lazy instead of its default greedy: meaning match as little as possible. The (?U) expression does not have anything to do with unicode when dealing with PCRE and mode modifiers. The question mark in this case is being used to signify different modes that may be turned on/off for expressions (or parts of expressions). And in this case, yes ... (?U) means turn on ungreedy mode. Which reverses the greediness/laziness of ALL repetition quantifiers. (?U)a* is lazy and (?U)a? is greedy. In Sigil, including (?U) will also reverse the effect of checking the Minimal Match box. So in your examples: (?U)<h2([^>]>.)</h2> Would be the same as: <h2([^>]?>.?)</h2> In this particular case, <h2([^>]?> is essentially the same as <h2([^>]> since the negated character-class [^>] prevents the repetition character from extending beyond the the next '>' anyway. (?s) is essentially the same as ticking the dotAll box in sigil. It treats everything as a single line because the dot character will match everything (including newline characters). It's opposite is (?m). These affect the special ^ and $ characters. I rarely find the need to use (?Usm) myself. The dotAll and Minimal Match check boxes in Sigil achieve the same same thing. They can be handy if you need to turn on any of the modes for only certain portions of an expression, though. In case you've not seen it: https://www.regular-expressions.info/tutorial.html is the best free regex resource that I've personally encountered on internet. Pretty-much everything I've picked up about regex comes from there. Last edited by DiapDealer; 08-05-2019 at 09:27 AM.

08-05-2019, 12:27 PM	#595
DNSB Bibliophagist Posts: 35,401 Karma: 145435140 Join Date: Jul 2010 Location: Vancouver Device: Kobo Sage, Forma, Clara HD, Lenovo M8 FHD, Paperwhite 4, Tolino epos	@diagdealer: Thanks for the correction. I ran into the (?u) as being Unicode related playing with Python a while back and only used ? for when I wanted non-greedy. The joys of trying to apply one regex implementation's documentation to other regex implementations. Last edited by DNSB; 08-05-2019 at 12:39 PM.

08-05-2019, 12:05 AM	#591
odamizu just an egg Posts: 1,586 Karma: 4300000 Join Date: Mar 2015 Device: Kindle, iOS	Another regex question if you will indulge me: What's the difference between (?U) and ? Is there an advantage to using (?U).* rather than .*? or vice versa? Thank you!

08-05-2019, 02:07 PM	#597
theducks Well trained by Cats Posts: 29,801 Karma: 54830978 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	remember the CASE of the letter changes what it does or mode in many places.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Examples of Subgroups	emonti8384	Lounge	32	02-26-2011 06:00 PM
Accessories Pen examples	Gunnerp245	enTourage Archive	15	02-21-2011 03:23 PM
Stylesheet examples?	Skitzman69	Sigil	15	09-24-2010 08:24 PM
Examples	kafkaesque1978	iRiver Story	1	07-26-2010 03:49 PM
Looking for examples of typos in eBooks	Tonycole	General Discussions	1	05-05-2010 04:23 AM