MobileRead Forums - View Single Post

DiapDealer · 08-05-2019, 09:17 AM

The question mark can be a tricky bugger. When used after a character (or character class or grouping), it essentially makes what precedes it optional (technically it's a repetition operator meaning repeat the preceding 0, or 1 times).

When used after another repetition character like + or * its effect is to make that repetition character lazy instead of its default greedy: meaning match as little as possible.

The (?U) expression does not have anything to do with unicode when dealing with PCRE and mode modifiers. The question mark in this case is being used to signify different modes that may be turned on/off for expressions (or parts of expressions). And in this case, yes ... (?U) means turn on ungreedy mode. Which reverses the greediness/laziness of ALL repetition quantifiers. (?U)a* is lazy and (?U)a*? is greedy. In Sigil, including (?U) will also reverse the effect of checking the Minimal Match box.

So in your examples:
(?U)<h2([^>]*>.*)</h2>

Would be the same as:
<h2([^>]*?>.*?)</h2>

In this particular case, <h2([^>]*?> is essentially the same as <h2([^>]*> since the negated character-class [^>] prevents the * repetition character from extending beyond the the next '>' anyway.

(?s) is essentially the same as ticking the dotAll box in sigil. It treats everything as a single line because the dot character will match everything (including newline characters). It's opposite is (?m). These affect the special ^ and $ characters.

I rarely find the need to use (?Usm) myself. The dotAll and Minimal Match check boxes in Sigil achieve the same same thing. They can be handy if you need to turn on any of the modes for only certain portions of an expression, though.

In case you've not seen it: https://www.regular-expressions.info/tutorial.html is the best free regex resource that I've personally encountered on internet. Pretty-much everything I've picked up about regex comes from there.

08-05-2019, 09:17 AM	#594
DiapDealer Grand Sorcerer Posts: 27,549 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	The question mark can be a tricky bugger. When used after a character (or character class or grouping), it essentially makes what precedes it optional (technically it's a repetition operator meaning repeat the preceding 0, or 1 times). When used after another repetition character like + or * its effect is to make that repetition character lazy instead of its default greedy: meaning match as little as possible. The (?U) expression does not have anything to do with unicode when dealing with PCRE and mode modifiers. The question mark in this case is being used to signify different modes that may be turned on/off for expressions (or parts of expressions). And in this case, yes ... (?U) means turn on ungreedy mode. Which reverses the greediness/laziness of ALL repetition quantifiers. (?U)a* is lazy and (?U)a? is greedy. In Sigil, including (?U) will also reverse the effect of checking the Minimal Match box. So in your examples: (?U)<h2([^>]>.)</h2> Would be the same as: <h2([^>]?>.?)</h2> In this particular case, <h2([^>]?> is essentially the same as <h2([^>]> since the negated character-class [^>] prevents the repetition character from extending beyond the the next '>' anyway. (?s) is essentially the same as ticking the dotAll box in sigil. It treats everything as a single line because the dot character will match everything (including newline characters). It's opposite is (?m). These affect the special ^ and $ characters. I rarely find the need to use (?Usm) myself. The dotAll and Minimal Match check boxes in Sigil achieve the same same thing. They can be handy if you need to turn on any of the modes for only certain portions of an expression, though. In case you've not seen it: https://www.regular-expressions.info/tutorial.html is the best free regex resource that I've personally encountered on internet. Pretty-much everything I've picked up about regex comes from there. Last edited by DiapDealer; 08-05-2019 at 09:27 AM.