07-19-2019, 04:19 PM | #586 | |
Grand Sorcerer
Posts: 5,584
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
Quote:
For example, if your text contains: Code:
123456789ABCDEF Code:
(.)(.)(.)(.)(.)(.)(.)(.)(.)(?<a>.)(?<b>.)(?<c>.)(?<d>.)(?<e>.)(?<f>.) Code:
\g{f}\g{e}\g{d}\g{c}\g{b}\g{a} Code:
FEDCBA |
|
07-25-2019, 04:28 PM | #587 |
Wizard
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
About small caps.
On a big bibliographical file, I have about 300 names of authors written with caps like ZOLA and LA BRUYÈRE and BALZAC and so on. I would like to write a regex that would allow me to write each name with small caps this way: Code:
<span class="smcp">La Bruyère</span> |
07-25-2019, 05:25 PM | #588 | ||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Fascinating. Had no idea about subpatterns. But once you reach 10 groups, it's probably best to break the Regex down into smaller, more understandable chunks. Quote:
Smallcaps Unicode: Search: (*UCP)([[:upper:]])([[:upper:]]{2,}) Replace: <span class="smallcaps">\1\L\2\E</span> Definitely don't Replace All while using this one, as there can be many false positives. What each part is doing, in plain English: (*UCP) = This tells PCRE to be "unicode aware". Allows you to get those accented characters, like È. [[:upper:]] = Grabs the first uppercase character. (Becomes Group 1) [[:upper:]]{2,} = Grabs the next 2 or more uppercase characters. (Becomes Group 2) * * * Side Note: It won't work on two letter ALL CAPS words, but that can easily be adjusted by changing the {2,} into a {1,}: Search: (*UCP)([[:upper:]])([[:upper:]]{1,}) (I do this in a later pass, because there's probably edge cases, like "DE" -> "de".) * * * This Replace: <span class="smallcaps">\1\L\2\E</span> puts Group 1 (the first uppercase letter) back. Then says lowercase everything beyond that point. In your example text: Code:
<p>ZOLA and LA BRUYÈRE and BALZAC</p> Code:
<p><span class="smallcaps">Zola</span> and LA <span class="smallcaps">Bruyère</span> and <span class="smallcaps">Balzac</span></p> Code:
<p><span class="smallcaps">Zola</span> and <span class="smallcaps">La</span> <span class="smallcaps">Bruyère</span> and <span class="smallcaps">Balzac</span></p> Search: </span> <span class="smallcaps"> Replace: JUST-PUT-A-SPACE-HERE that would merge "La Bruyère" into a single span. And you would probably have to handle Middle Initials: Search: </span> ([[:upper:]]\.) <span class="smallcaps"> Replace: \1 Note: Make sure to insert a space before and after this Replace. Last edited by Tex2002ans; 07-25-2019 at 05:28 PM. |
||
07-26-2019, 12:41 AM | #589 |
Wizard
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
@Tex2002ans
Thank you very much for your detailed answer and explanation!! Much appreciated. It worked pretty well for all 240 items. On the remaining part of the text, I have numerous XVI that I wished to set to: Code:
<span class="smcp">xvi</span> Code:
<span class="smcp">\L\1\E</span> Last edited by roger64; 07-26-2019 at 02:27 AM. Reason: success |
07-26-2019, 03:46 AM | #590 |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Since Roman Numerals only use a handful of characters, I always just use something along these lines:
Search: \b([XIV]{2,})\b Replace: <span class="romannumeral">\1</span> That will find you most. (You can add LCDM if you want... but I've yet to see a book really go that high.) Note #1: To get single character roman numerals, you'll have to be much more careful. I usually split it into different searches, because 'I' is such a common English word. So something like this: Search: \b([XV]+)\b would find more single Roman Numerals. Note #2: Roman Numerals usually also depend on some further context:
so I use the specifics of the book to help refine the searches: Search: Chapter ([XIV]+) Replace: Chapter <span class="romannumeral">\1</span> Last edited by Tex2002ans; 07-26-2019 at 03:53 AM. |
08-05-2019, 12:05 AM | #591 |
just an egg
Posts: 1,586
Karma: 4300000
Join Date: Mar 2015
Device: Kindle, iOS
|
Another regex question if you will indulge me:
What's the difference between (?U) and ? Is there an advantage to using (?U).* rather than .*? or vice versa? Thank you! |
08-05-2019, 12:54 AM | #592 | |
Bibliophagist
Posts: 35,401
Karma: 145435140
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Forma, Clara HD, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
Quote:
So something like (?u)(.*?) instead of (.*?) if you want to match on Unicode. OTOH, I vaguely remember that Python3 matches on Unicode by default making (?u) and it's equivalents (re.U, re.UNICODE) obsolete. |
|
08-05-2019, 01:15 AM | #593 | |
just an egg
Posts: 1,586
Karma: 4300000
Join Date: Mar 2015
Device: Kindle, iOS
|
Quote:
This is in reference to the Minimal Match checkbox in Sigil's Find/Replace widget, and also to the default Example Saved Search for promoting/demoting Headings which adds (?sU) as a prefix: Find: (?sU)<h2([^>]*>.*)</h2> Replace: <h1\1</h1> I'm trying to learn if an equivalent to the above Find might be: (?s)<h2([^>]*>.*?)</h2> If they're not equivalent, what's the difference and/or advantage to using (?U) vs .*? here? Thank you Last edited by odamizu; 08-06-2019 at 01:06 AM. Reason: correction: example saved search, not default |
|
08-05-2019, 09:17 AM | #594 |
Grand Sorcerer
Posts: 27,549
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
The question mark can be a tricky bugger. When used after a character (or character class or grouping), it essentially makes what precedes it optional (technically it's a repetition operator meaning repeat the preceding 0, or 1 times).
When used after another repetition character like + or * its effect is to make that repetition character lazy instead of its default greedy: meaning match as little as possible. The (?U) expression does not have anything to do with unicode when dealing with PCRE and mode modifiers. The question mark in this case is being used to signify different modes that may be turned on/off for expressions (or parts of expressions). And in this case, yes ... (?U) means turn on ungreedy mode. Which reverses the greediness/laziness of ALL repetition quantifiers. (?U)a* is lazy and (?U)a*? is greedy. In Sigil, including (?U) will also reverse the effect of checking the Minimal Match box. So in your examples: (?U)<h2([^>]*>.*)</h2> Would be the same as: <h2([^>]*?>.*?)</h2> In this particular case, <h2([^>]*?> is essentially the same as <h2([^>]*> since the negated character-class [^>] prevents the * repetition character from extending beyond the the next '>' anyway. (?s) is essentially the same as ticking the dotAll box in sigil. It treats everything as a single line because the dot character will match everything (including newline characters). It's opposite is (?m). These affect the special ^ and $ characters. I rarely find the need to use (?Usm) myself. The dotAll and Minimal Match check boxes in Sigil achieve the same same thing. They can be handy if you need to turn on any of the modes for only certain portions of an expression, though. In case you've not seen it: https://www.regular-expressions.info/tutorial.html is the best free regex resource that I've personally encountered on internet. Pretty-much everything I've picked up about regex comes from there. Last edited by DiapDealer; 08-05-2019 at 09:27 AM. |
08-05-2019, 12:27 PM | #595 |
Bibliophagist
Posts: 35,401
Karma: 145435140
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Forma, Clara HD, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
@diagdealer: Thanks for the correction. I ran into the (?u) as being Unicode related playing with Python a while back and only used ? for when I wanted non-greedy. The joys of trying to apply one regex implementation's documentation to other regex implementations.
Last edited by DNSB; 08-05-2019 at 12:39 PM. |
08-05-2019, 12:45 PM | #596 | |
Grand Sorcerer
Posts: 27,549
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
re.I (?i) ignore case. re.S (?s) single line re.M (?m) multiline re.U turns on the unicode behavior of {\w , \W , \b , \B} in Python, but (?U) puts repetition characters in ungreedy mode in PCRE. To turn on unicode support for operators like \w \W \d \b, etc in PCRE, you need to preface the expression with (*UCP). And that's if PCRE was compiled with unicode suport (which Sigil's PCRE clearly is). Last edited by DiapDealer; 08-05-2019 at 01:02 PM. |
|
08-05-2019, 02:07 PM | #597 |
Well trained by Cats
Posts: 29,801
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
remember the CASE of the letter changes what it does or mode in many places.
|
08-06-2019, 02:04 AM | #598 | ||
just an egg
Posts: 1,586
Karma: 4300000
Join Date: Mar 2015
Device: Kindle, iOS
|
Quote:
Quote:
|
||
08-06-2019, 02:24 AM | #599 | |
Wizard
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
About small caps
Quote:
Maybe it's a little greedy because it also transforms words written in capitals which are included in the head like "DOCTYPE". Is there a way to make it work strictly within body tags? |
|
08-06-2019, 09:57 AM | #600 |
Grand Sorcerer
Posts: 27,549
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
The short answer is no.
The long answer is also no--it just takes longer to read. Needing regex to stay between certain tags, or to only include stuff between tags means you gone beyond what plain regex can do for you. You've moved into the realm of parsing and algorithms. The Function Mode of the Calibre Editor's Search and Replace feature comes to mind. Last edited by DiapDealer; 08-06-2019 at 10:00 AM. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Examples of Subgroups | emonti8384 | Lounge | 32 | 02-26-2011 06:00 PM |
Accessories Pen examples | Gunnerp245 | enTourage Archive | 15 | 02-21-2011 03:23 PM |
Stylesheet examples? | Skitzman69 | Sigil | 15 | 09-24-2010 08:24 PM |
Examples | kafkaesque1978 | iRiver Story | 1 | 07-26-2010 03:49 PM |
Looking for examples of typos in eBooks | Tonycole | General Discussions | 1 | 05-05-2010 04:23 AM |