![]() |
#1 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 107
Karma: 1000
Join Date: Mar 2011
Device: Kindle
|
More than 9 back refrences in regex
Hello!
I'm processing a book where I need to have more than 9 back references. Starting with \10 Sigil just treats that as the text to replace with instead of part of the 10th part of the search expression (please forgive me if I'm not using the proper terms. I found on line that some regex systems can use $10, $11, etc, but that doesn't seem to work either. At this point my only option seems to be to combine the book into a single text file and work with it in Notepad++. Not the worst option in the world, but I thought I would ask here first. Thanks!! |
![]() |
![]() |
![]() |
#2 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,725
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
|
Quote:
For example, if your book contains the string: Code:
abcdefghijkl Find: Code:
(?<a>a)(?<b>b)(?<c>c)(?<d>d)(?<e>e)(?<f>f)(?<g>g)(?<h>h)(?<i>i)(?<j>j)(?<attr>k)(?<l>l) Code:
\g{f}\g{e}\g{d}\g{c}\g{b}\g{a}\g{j}\g{l}\g{attr} Code:
fedcbajlk Last edited by Doitsu; 03-23-2021 at 05:24 AM. |
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
A Hairy Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,346
Karma: 20171571
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
|
You could also process the book with multiple passes.
|
![]() |
![]() |
![]() |
#4 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28,557
Karma: 204127028
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
All available documentation I can find suggests that PCRE should be able to easily support both \1 through \9 and \10 through \99 backreferences, but clearly Sigil's bundled PCRE does not. But it seems the PCRE bundled with Sigil DOES allow for the \g{n} backreference syntax which can exceed the 9 backreference limit.
String: <p>0123456789abc</p> Find: (\d)(\d)(\d)(\d)(\d)(\d)(\d)(\d)(\d)(\d)([a-z])([a-z])([a-z]) Replace: \1\2\3\4\5\6\7\8\9\g{10}\g{11}\g{12}\g{13} The bottom line seems to be that anything other than a single digit (0-9) after the backslash is ambiguous. It could be a backreference, or it could be character code (or an octal number). For completely unambiguous double-digit backreferences, always use the \g{nn} syntax. From Sigil's src/PCRE/SPCRE.cpp: Code:
// The maximum number of catpures that we will allow. const int PCRE_MAX_CAPTURE_GROUPS = 30; Last edited by DiapDealer; 03-23-2021 at 01:26 PM. |
![]() |
![]() |
![]() |
#5 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Usually when you reach beyond 9 capture groups in a single regex... there's some sort of underlying issue that can be solved more efficiently. |
|
![]() |
![]() |
Advert | |
|
![]() |
#6 | |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 107
Karma: 1000
Join Date: Mar 2011
Device: Kindle
|
Quote:
I have an interlinear text with verses where the six lines of text are interspersed, and each line has an index number. <p> index1 originalline1 tranlsationline1 index2 originalline2 tranlsationline2 index3 originalline3 tranlsationline3 index4 originalline4 tranlsationline4 index5 originalline5 tranlsationline5 index6 originalline6 tranlsationline6</p> And what I need is <p> index1 originalline1 index2 originalline2 index3 originalline3 index4 originalline4 index5 originalline5 index6 originalline6 index1 tranlsationline1 index2 tranlsationline2 index3 tranlsationline3 index4 tranlsationline4 index5 tranlsationline5 index6 tranlsationline6</p> But labeling the segments works just fine. Thanks everyone for all the help!!! |
|
![]() |
![]() |
![]() |
#7 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
2. Are the index numbers chronological? 3. Are the different languages marked up in any way? (How can you tell which one is language1 + language2?) Step 1 would be infinitely easier if your stuff is actually marked up with proper lang: Quote:
Note #2: What I'm thinking is a multi-pass approach. Something like: Step 1: Split the original+translated and temporarily tag the index: Code:
Orig:index1 originalline1 Tran:index1 tranlsationline1 Orig:index2 originalline2 Tran:index2 tranlsationline2 [...] Orig:index6 originalline6 Tran:index6 tranlsationline6 Code:
Orig:index1 originalline1 Orig:index2 originalline2 [...] Orig:index6 originalline6 Tran:index1 tranlsationline1 Tran:index2 tranlsationline2 [...] Tran:index6 tranlsationline6 ... But it all depends on your actual text... |
||
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Back button should go back where you exactly were | LittleBiG | KOReader | 1 | 03-11-2021 11:43 AM |
Predefined regex for Regex-function | sherman | Editor | 3 | 01-19-2020 05:32 AM |
Reading books back to back? | TheSmitty | General Discussions | 50 | 08-01-2018 01:45 PM |
Regex | Faster | Sigil | 2 | 04-24-2011 09:08 PM |