Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 03-23-2021, 04:19 AM   #1
BKh
Zealot
BKh can extract oil from cheeseBKh can extract oil from cheeseBKh can extract oil from cheeseBKh can extract oil from cheeseBKh can extract oil from cheeseBKh can extract oil from cheeseBKh can extract oil from cheeseBKh can extract oil from cheese
 
BKh's Avatar
 
Posts: 107
Karma: 1000
Join Date: Mar 2011
Device: Kindle
More than 9 back refrences in regex

Hello!

I'm processing a book where I need to have more than 9 back references. Starting with \10 Sigil just treats that as the text to replace with instead of part of the 10th part of the search expression (please forgive me if I'm not using the proper terms.

I found on line that some regex systems can use $10, $11, etc, but that doesn't seem to work either.

At this point my only option seems to be to combine the book into a single text file and work with it in Notepad++. Not the worst option in the world, but I thought I would ask here first.

Thanks!!
BKh is offline   Reply With Quote
Old 03-23-2021, 05:20 AM   #2
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,725
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by BKh View Post
I found on line that some regex systems can use $10, $11, etc, but that doesn't seem to work either.
Sigil uses the PCRE2 engine for regex searches, which supports named subpatterns.

For example, if your book contains the string:

Code:
abcdefghijkl
you could use the following expression to capture each letter:

Find:
Code:
(?<a>a)(?<b>b)(?<c>c)(?<d>d)(?<e>e)(?<f>f)(?<g>g)(?<h>h)(?<i>i)(?<j>j)(?<attr>k)(?<l>l)
Replace:
Code:
\g{f}\g{e}\g{d}\g{c}\g{b}\g{a}\g{j}\g{l}\g{attr}
You'll end up with:

Code:
fedcbajlk

Last edited by Doitsu; 03-23-2021 at 05:24 AM.
Doitsu is offline   Reply With Quote
Advert
Old 03-23-2021, 08:03 AM   #3
Turtle91
A Hairy Wizard
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 3,346
Karma: 20171571
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
You could also process the book with multiple passes.
Turtle91 is offline   Reply With Quote
Old 03-23-2021, 09:59 AM   #4
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 28,557
Karma: 204127028
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
All available documentation I can find suggests that PCRE should be able to easily support both \1 through \9 and \10 through \99 backreferences, but clearly Sigil's bundled PCRE does not. But it seems the PCRE bundled with Sigil DOES allow for the \g{n} backreference syntax which can exceed the 9 backreference limit.

String: <p>0123456789abc</p>
Find: (\d)(\d)(\d)(\d)(\d)(\d)(\d)(\d)(\d)(\d)([a-z])([a-z])([a-z])
Replace: \1\2\3\4\5\6\7\8\9\g{10}\g{11}\g{12}\g{13}

The bottom line seems to be that anything other than a single digit (0-9) after the backslash is ambiguous. It could be a backreference, or it could be character code (or an octal number). For completely unambiguous double-digit backreferences, always use the \g{nn} syntax.

From Sigil's src/PCRE/SPCRE.cpp:

Code:
// The maximum number of catpures that we will allow.
const int PCRE_MAX_CAPTURE_GROUPS = 30;
So the number of backreferences will also be capped at 30 (provided 30 groups were, in fact, captured). Whether accessed by name or number via the \g{} syntax.

Last edited by DiapDealer; 03-23-2021 at 01:26 PM.
DiapDealer is offline   Reply With Quote
Old 03-23-2021, 03:58 PM   #5
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by BKh View Post
I'm processing a book where I need to have more than 9 back references. Starting with \10 Sigil just treats that as the text to replace with instead of part of the 10th part of the search expression [...].
What is the problem you're trying to solve? Can you give some before/after code examples?

Usually when you reach beyond 9 capture groups in a single regex... there's some sort of underlying issue that can be solved more efficiently.
Tex2002ans is offline   Reply With Quote
Advert
Old 03-24-2021, 10:58 PM   #6
BKh
Zealot
BKh can extract oil from cheeseBKh can extract oil from cheeseBKh can extract oil from cheeseBKh can extract oil from cheeseBKh can extract oil from cheeseBKh can extract oil from cheeseBKh can extract oil from cheeseBKh can extract oil from cheese
 
BKh's Avatar
 
Posts: 107
Karma: 1000
Join Date: Mar 2011
Device: Kindle
Quote:
Originally Posted by Tex2002ans View Post
What is the problem you're trying to solve? Can you give some before/after code examples?

Usually when you reach beyond 9 capture groups in a single regex... there's some sort of underlying issue that can be solved more efficiently.
Absolutely! And if my multiple attempts to learn Python had ever succeeded I wouldn't be in this situation.

I have an interlinear text with verses where the six lines of text are interspersed, and each line has an index number.

<p>
index1 originalline1 tranlsationline1
index2 originalline2 tranlsationline2
index3 originalline3 tranlsationline3
index4 originalline4 tranlsationline4
index5 originalline5 tranlsationline5
index6 originalline6 tranlsationline6</p>

And what I need is

<p>
index1 originalline1
index2 originalline2
index3 originalline3
index4 originalline4
index5 originalline5
index6 originalline6

index1 tranlsationline1
index2 tranlsationline2
index3 tranlsationline3
index4 tranlsationline4
index5 tranlsationline5
index6 tranlsationline6</p>

But labeling the segments works just fine.

Thanks everyone for all the help!!!
BKh is offline   Reply With Quote
Old 03-25-2021, 03:02 AM   #7
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by BKh View Post
I have an interlinear text with verses where the six lines of text are interspersed, and each line has an index number.

Code:
<p>index1 originalline1 tranlsationline1
index2 originalline2 tranlsationline2
index3 originalline3 tranlsationline3
index4 originalline4 tranlsationline4
index5 originalline5 tranlsationline5
index6 originalline6 tranlsationline6</p>
1. Can you show a few actual examples out of your book?

2. Are the index numbers chronological?

3. Are the different languages marked up in any way? (How can you tell which one is language1 + language2?)

Step 1 would be infinitely easier if your stuff is actually marked up with proper lang:

Quote:
100 <span class="english" lang="en" xml:lang="en">This is an example verse</span> <span class="spanish" lang="es" xml:lang="es">Este es un verso de ejemplo</span>
101 <span class="english" lang="en" xml:lang="en">[...]</span> <span class="spanish" lang="es" xml:lang="es">[...]</span>
Note: To format your code nicer, you can put your code in between [CODE][/CODE] tags. In MobileRead's "Advanced Editor", it also looks like a # button.

Note #2: What I'm thinking is a multi-pass approach. Something like:

Step 1: Split the original+translated and temporarily tag the index:

Code:
Orig:index1 originalline1
Tran:index1 tranlsationline1
Orig:index2 originalline2
Tran:index2 tranlsationline2
[...]
Orig:index6 originalline6
Tran:index6 tranlsationline6
Step 2: Swap any "Tran:" line if an "Orig:" is below:

Code:
Orig:index1 originalline1
Orig:index2 originalline2
[...]
Orig:index6 originalline6
Tran:index1 tranlsationline1
Tran:index2 tranlsationline2
[...]
Tran:index6 tranlsationline6
Step 3: Remove the "Orig:" + "Tran:".

... But it all depends on your actual text...
Tex2002ans is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Back button should go back where you exactly were LittleBiG KOReader 1 03-11-2021 11:43 AM
Predefined regex for Regex-function sherman Editor 3 01-19-2020 05:32 AM
Reading books back to back? TheSmitty General Discussions 50 08-01-2018 01:45 PM
Regex Faster Sigil 2 04-24-2011 09:08 PM


All times are GMT -4. The time now is 05:07 AM.


MobileRead.com is a privately owned, operated and funded community.