More than 9 back refrences in regex

BKh · 03-23-2021, 04:19 AM

Hello!

I'm processing a book where I need to have more than 9 back references. Starting with \10 Sigil just treats that as the text to replace with instead of part of the 10th part of the search expression (please forgive me if I'm not using the proper terms.

I found on line that some regex systems can use $10, $11, etc, but that doesn't seem to work either.

At this point my only option seems to be to combine the book into a single text file and work with it in Notepad++. Not the worst option in the world, but I thought I would ask here first.

Thanks!!

Doitsu · 03-23-2021, 05:20 AM

Quote:

Originally Posted by BKh

I found on line that some regex systems can use $10, $11, etc, but that doesn't seem to work either.

Sigil uses the PCRE2 engine for regex searches, which supports named subpatterns.

For example, if your book contains the string:

Code:

abcdefghijkl

you could use the following expression to capture each letter:

Find:

Code:

(?<a>a)(?<b>b)(?<c>c)(?<d>d)(?<e>e)(?<f>f)(?<g>g)(?<h>h)(?<i>i)(?<j>j)(?<attr>k)(?<l>l)

Replace:

Code:

\g{f}\g{e}\g{d}\g{c}\g{b}\g{a}\g{j}\g{l}\g{attr}

You'll end up with:

Code:

fedcbajlk

Turtle91 · 03-23-2021, 08:03 AM

You could also process the book with multiple passes.

DiapDealer · 03-23-2021, 09:59 AM

All available documentation I can find suggests that PCRE should be able to easily support both \1 through \9 and \10 through \99 backreferences, but clearly Sigil's bundled PCRE does not. But it seems the PCRE bundled with Sigil DOES allow for the \g{n} backreference syntax which can exceed the 9 backreference limit.

String: 0123456789abc
Find: (\d)(\d)(\d)(\d)(\d)(\d)(\d)(\d)(\d)(\d)([a-z])([a-z])([a-z])
Replace: \1\2\3\4\5\6\7\8\9\g{10}\g{11}\g{12}\g{13}

The bottom line seems to be that anything other than a single digit (0-9) after the backslash is ambiguous. It could be a backreference, or it could be character code (or an octal number). For completely unambiguous double-digit backreferences, always use the \g{nn} syntax.

From Sigil's src/PCRE/SPCRE.cpp:

Code:

// The maximum number of catpures that we will allow.
const int PCRE_MAX_CAPTURE_GROUPS = 30;

So the number of backreferences will also be capped at 30 (provided 30 groups were, in fact, captured). Whether accessed by name or number via the \g{} syntax.

Tex2002ans · 03-23-2021, 03:58 PM

Quote:

Originally Posted by BKh

I'm processing a book where I need to have more than 9 back references. Starting with \10 Sigil just treats that as the text to replace with instead of part of the 10th part of the search expression [...].

What is the problem you're trying to solve? Can you give some before/after code examples?

Usually when you reach beyond 9 capture groups in a single regex... there's some sort of underlying issue that can be solved more efficiently.

BKh · 03-24-2021, 10:58 PM

Quote:

Originally Posted by Tex2002ans

What is the problem you're trying to solve? Can you give some before/after code examples?

Usually when you reach beyond 9 capture groups in a single regex... there's some sort of underlying issue that can be solved more efficiently.

Absolutely! And if my multiple attempts to learn Python had ever succeeded I wouldn't be in this situation.

I have an interlinear text with verses where the six lines of text are interspersed, and each line has an index number.


index1 originalline1 tranlsationline1
index2 originalline2 tranlsationline2
index3 originalline3 tranlsationline3
index4 originalline4 tranlsationline4
index5 originalline5 tranlsationline5
index6 originalline6 tranlsationline6

And what I need is


index1 originalline1
index2 originalline2
index3 originalline3
index4 originalline4
index5 originalline5
index6 originalline6

index1 tranlsationline1
index2 tranlsationline2
index3 tranlsationline3
index4 tranlsationline4
index5 tranlsationline5
index6 tranlsationline6

But labeling the segments works just fine.

Thanks everyone for all the help!!!

Tex2002ans · 03-25-2021, 03:02 AM

Quote:

Originally Posted by BKh

I have an interlinear text with verses where the six lines of text are interspersed, and each line has an index number.

Code:

<p>index1 originalline1 tranlsationline1
index2 originalline2 tranlsationline2
index3 originalline3 tranlsationline3
index4 originalline4 tranlsationline4
index5 originalline5 tranlsationline5
index6 originalline6 tranlsationline6</p>

1. Can you show a few actual examples out of your book?

2. Are the index numbers chronological?

3. Are the different languages marked up in any way? (How can you tell which one is language1 + language2?)

Step 1 would be infinitely easier if your stuff is actually marked up with proper lang:

Quote:

100 This is an example verse Este es un verso de ejemplo
101 [...] [...]

Note: To format your code nicer, you can put your code in between [CODE][/CODE] tags. In MobileRead's "Advanced Editor", it also looks like a # button.

Note #2: What I'm thinking is a multi-pass approach. Something like:

Step 1: Split the original+translated and temporarily tag the index:

Code:

Orig:index1 originalline1
Tran:index1 tranlsationline1
Orig:index2 originalline2
Tran:index2 tranlsationline2
[...]
Orig:index6 originalline6
Tran:index6 tranlsationline6

Step 2: Swap any "Tran:" line if an "Orig:" is below:

Code:

Orig:index1 originalline1
Orig:index2 originalline2
[...]
Orig:index6 originalline6
Tran:index1 tranlsationline1
Tran:index2 tranlsationline2
[...]
Tran:index6 tranlsationline6

Step 3: Remove the "Orig:" + "Tran:".

... But it all depends on your actual text...

03-23-2021, 04:19 AM	#1
BKh Zealot Posts: 107 Karma: 1000 Join Date: Mar 2011 Device: Kindle	More than 9 back refrences in regex Hello! I'm processing a book where I need to have more than 9 back references. Starting with \10 Sigil just treats that as the text to replace with instead of part of the 10th part of the search expression (please forgive me if I'm not using the proper terms. I found on line that some regex systems can use $10, $11, etc, but that doesn't seem to work either. At this point my only option seems to be to combine the book into a single text file and work with it in Notepad++. Not the worst option in the world, but I thought I would ask here first. Thanks!!

03-23-2021, 09:59 AM	#4
DiapDealer Grand Sorcerer Posts: 28,557 Karma: 204127028 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	All available documentation I can find suggests that PCRE should be able to easily support both \1 through \9 and \10 through \99 backreferences, but clearly Sigil's bundled PCRE does not. But it seems the PCRE bundled with Sigil DOES allow for the \g{n} backreference syntax which can exceed the 9 backreference limit. String: <p>0123456789abc</p> Find: (\d)(\d)(\d)(\d)(\d)(\d)(\d)(\d)(\d)(\d)([a-z])([a-z])([a-z]) Replace: \1\2\3\4\5\6\7\8\9\g{10}\g{11}\g{12}\g{13} The bottom line seems to be that anything other than a single digit (0-9) after the backslash is ambiguous. It could be a backreference, or it could be character code (or an octal number). For completely unambiguous double-digit backreferences, always use the \g{nn} syntax. From Sigil's src/PCRE/SPCRE.cpp: Code: // The maximum number of catpures that we will allow. const int PCRE_MAX_CAPTURE_GROUPS = 30; So the number of backreferences will also be capped at 30 (provided 30 groups were, in fact, captured). Whether accessed by name or number via the \g{} syntax. Last edited by DiapDealer; 03-23-2021 at 01:26 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Back button should go back where you exactly were	LittleBiG	KOReader	1	03-11-2021 11:43 AM
Predefined regex for Regex-function	sherman	Editor	3	01-19-2020 05:32 AM
Reading books back to back?	TheSmitty	General Discussions	50	08-01-2018 01:45 PM
Regex	Faster	Sigil	2	04-24-2011 09:08 PM

03-23-2021, 08:03 AM	#3
Turtle91 A Hairy Wizard Posts: 3,346 Karma: 20171571 Join Date: Dec 2012 Location: Charleston, SC today Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire	You could also process the book with multiple passes.

Advert

Advert