Regex examples - Page 45

leschek · 09-16-2020, 07:09 AM

I hope this is correct topic to post to.

In my language we use one letter prepositions and conjunctions (a, i, o, u, k, s, v, z) which shouldn't be on the end of lines. Here is example from book I try to "epubize":
"spatřil člun a v tom člunu". (translation: "he saw a boat and in that boat")
What I want is to find letters "a" and "v" and replace them with no-break space to connect them to following word. I have this regex (I found somewhere)

Code:

\s([aiouksvz])\s

for searching, but it finds only the first letter and then skip the second one. I tried to change the searching direction in Sigil to "up", but it doesn't help. I guess there must be some problem with regex I'm using.

I also tried this example and again it finds only every second letter:

Code:

<p>some words a s i k v some words</p>

It seems there is some problem with space between letters. When I double it to:

Code:

<p>some words a  s  i  k  v  some words</p>

the searching works.

davidfor · 09-16-2020, 08:00 AM

I think you want:

Code:

\b([aiouksvz])\s

That will pick a single letter followed by whitespace.

leschek · 09-16-2020, 08:15 AM

Thank you, it works partialy, but it does find also parts of html code as

Code:

<a href...

and words ending with searched characters with previous character from non English alphabet as nás, při etc.

DiapDealer · 09-16-2020, 12:36 PM

Quote:

Originally Posted by leschek

Thank you, it works partialy, but it does find also parts of html code as

Code:

<a href...

and words ending with searched characters with previous character from non English alphabet as nás, při etc.

I'm tackling your exceptions in reverse order.

To make \b honor unicode codepoints, turn on the Unicode Character Properties flag with (*UCP)

So the above"

Code:

\b([aiouksvz])\s

becomes:

Code:

(*UCP)\b([aiouksvz])\s

This should exclude the 'i' and the 'a' characters in your 'nás' and 'při' examples

To make the expression ignore the character class matches that immediately follow an angled (x)html bracket (<) you can use a negative lookbehind. Something like:

Code:

(*UCP)(?<!\<)\b([aiouksvz])\s

should ignore the 'a' and 'i' characters used in (x)html's anchor and italic tags.

The (*UCP) flag and the (?<!\<) lookbehind are not captured groups despite the appearance. So the replacement you're looking for will still be something like:

Code:

\1&nbsp;

leschek · 09-16-2020, 05:53 PM

Quote:

Originally Posted by DiapDealer

Code:

(*UCP)(?<!\<)\b[^<]([aiouksvz])\s

So the replacement you're looking for will still be something like:

Code:

\1&nbsp;

Thank you for your time and explanation, but unfortunately it's working partially again. It ignores the html code (a href, i), which is great, but it doesn't find all letters I need to find. For example in sentence "spatřil člun a v tom člunu", it should find letters "a" and "v", but it only finds "a" and ignores "i". It also find some two-letters words as "na", "do" or in English "as" and "is".

DiapDealer · 09-16-2020, 08:13 PM

Apologies... I pasted the wrong full expression. It had an extraneous (and incorrect) negative character class that I was testing out.

This is the one that works for me for all of your examples so far:

Code:

(*UCP)(?<!\<)\b([aiouksvz])\s

leschek · 09-17-2020, 06:05 AM

Quote:

Originally Posted by DiapDealer

This is the one that works for me for all of your examples so far:

Code:

(*UCP)(?<!\<)\b([aiouksvz])\s

Thank you very much. I tried it on a few pages and it seems it's working as expected. Awesome.

ShdwMnrch · 09-18-2020, 09:54 AM

Hello,
I need help on regex, i have lines like these

Code:

<p>– Wahahahaha!</p>

<p>Grasha got drunk, raged and got on the table.</p>

<p>– Wahahahaha! This is a celebration party! Drink and sing guys!</p>

The dialogues are preceded with "– " but I wanna wrap the dialogues with 「」these characters. I had no problem replacing the
"–" to " 「" but I having problems replacing "" when "– " is present in the beginning of the lines.

I have tried the regex search of:

Code:

(?<=<p>– .*)<\/p>

It should find only the "" of the 1st and 3rd line, ignoring the 2nd line. But it seems like sigil regex doesn't support positive lookbehind and it returns nothing. Please help for any workaround. Thanks!

BeckyEbook · 09-18-2020, 12:09 PM

Try (as long as I understand your problem correctly):

Code:

(?<=<p>– )(.+)</p>

Replace:

Code:

\1」</p>

DiapDealer · 09-18-2020, 12:18 PM

Sigil's PCRE regex engine certainly supports positive lookbehinds. It just doesn't support variable-length lookbehinds--positive or negative. It's a known limitation of the PCRE engine.

Use \K to simulate a variable-length lookbehind:

Code:

<p>–( .*?)\K<\/p>

Make sure "Minimal Match" is unchecked when using the above expression.

More on the use of \K here: https://www.regular-expressions.info/keep.html

ShdwMnrch · 09-18-2020, 12:22 PM

Quote:

Originally Posted by BeckyEbook

Try (as long as I understand your problem correctly):

Code:

(?<=<p>– )(.+)</p>

Replace:

Code:

\1」</p>

It worked, didn't thought of tinkering with the replace value. thank you!

DiapDealer · 09-18-2020, 12:30 PM

Not sure why it looks like there's an extra space in my above expression. It seems to copy and work fine, though. *shrug*

ShdwMnrch · 09-18-2020, 12:31 PM

Quote:

Originally Posted by DiapDealer

Sigil's PCRE regex engine certainly supports positive lookbehinds. It just doesn't support variable-length lookbehinds--positive or negative. It's a known limitation of the PCRE engine.

Use \K to simulate a variable-length lookbehind:

Code:

<p>–( .*?)\K<\/p>

Make sure "Minimal Match" is unchecked when using the above expression.

More on the use of \K here: https://www.regular-expressions.info/keep.html

I see. I just started to learn how to use regex so that helps a lot, thank you

BillPearl · 09-18-2020, 05:06 PM

\[\s][a,i,o,u,k,s,v,zç]\[\s]

will handle '<a ' case finds space before and after letter. You may want to run this with just one letter at a time using Replace All

hobnail · 10-12-2020, 03:34 PM

I don't understand why this isn't working; my search string is:

<a id="Page_([xvi]+)|([\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+)|([\d]+)\]"></a>

When the file contains

<a id="Page_iv" class="x-ebookmaker-pageno" title="[iv]"></a>

and I click on the Find button, it highlights only

<a id="Page_i

What's wrong with my regexp?

09-16-2020, 07:09 AM	#661
leschek Enthusiast Posts: 32 Karma: 10 Join Date: Sep 2020 Device: Onyx Poke2	I hope this is correct topic to post to. In my language we use one letter prepositions and conjunctions (a, i, o, u, k, s, v, z) which shouldn't be on the end of lines. Here is example from book I try to "epubize": "spatřil člun a v tom člunu". (translation: "he saw a boat and in that boat") What I want is to find letters "a" and "v" and replace them with no-break space to connect them to following word. I have this regex (I found somewhere) Code: \s([aiouksvz])\s for searching, but it finds only the first letter and then skip the second one. I tried to change the searching direction in Sigil to "up", but it doesn't help. I guess there must be some problem with regex I'm using. I also tried this example and again it finds only every second letter: Code: <p>some words a s i k v some words</p> It seems there is some problem with space between letters. When I double it to: Code: <p>some words a s i k v some words</p> the searching works.

09-16-2020, 08:00 AM	#662
davidfor Grand Sorcerer Posts: 24,905 Karma: 47303826 Join Date: Jul 2011 Location: Sydney, Australia Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos	I think you want: Code: \b([aiouksvz])\s That will pick a single letter followed by whitespace.

09-16-2020, 08:15 AM	#663
leschek Enthusiast Posts: 32 Karma: 10 Join Date: Sep 2020 Device: Onyx Poke2	Thank you, it works partialy, but it does find also parts of html code as Code: <a href... and words ending with searched characters with previous character from non English alphabet as nás, při etc.

09-16-2020, 08:13 PM	#666
DiapDealer Grand Sorcerer Posts: 29,131 Karma: 211348980 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Apologies... I pasted the wrong full expression. It had an extraneous (and incorrect) negative character class that I was testing out. This is the one that works for me for all of your examples so far: Code: (*UCP)(?<!\<)\b([aiouksvz])\s

09-18-2020, 09:54 AM	#668
ShdwMnrch Junior Member Posts: 3 Karma: 10 Join Date: Sep 2020 Device: none	Hello, I need help on regex, i have lines like these Code: <p>– Wahahahaha!</p> <p>Grasha got drunk, raged and got on the table.</p> <p>– Wahahahaha! This is a celebration party! Drink and sing guys!</p> The dialogues are preceded with "<p>– " but I wanna wrap the dialogues with 「」these characters. I had no problem replacing the "<p>–" to " <p> 「" but I having problems replacing "</p>" when "<p>– " is present in the beginning of the lines. I have tried the regex search of: Code: (?<=<p>– .*)<\/p> It should find only the "</p>" of the 1st and 3rd line, ignoring the 2nd line. But it seems like sigil regex doesn't support positive lookbehind and it returns nothing. Please help for any workaround. Thanks!

09-18-2020, 12:09 PM	#669
BeckyEbook Guru Posts: 947 Karma: 3501880 Join Date: Jan 2017 Location: Poland Device: Various	Try (as long as I understand your problem correctly): Code: (?<=<p>– )(.+)</p> Replace: Code: \1」</p>

09-18-2020, 12:18 PM	#670
DiapDealer Grand Sorcerer Posts: 29,131 Karma: 211348980 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Sigil's PCRE regex engine certainly supports positive lookbehinds. It just doesn't support variable-length lookbehinds--positive or negative. It's a known limitation of the PCRE engine. Use \K to simulate a variable-length lookbehind: Code: <p>–( .*?)\K<\/p> Make sure "Minimal Match" is unchecked when using the above expression. More on the use of \K here: https://www.regular-expressions.info/keep.html

09-18-2020, 12:30 PM	#672
DiapDealer Grand Sorcerer Posts: 29,131 Karma: 211348980 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Not sure why it looks like there's an extra space in my above expression. It seems to copy and work fine, though. shrug

09-18-2020, 05:06 PM	#674
BillPearl Junior Member Posts: 8 Karma: 591908 Join Date: Jun 2011 Device: Kindle	Suggestion \[\s][a,i,o,u,k,s,v,zç]\[\s] will handle '<a ' case finds space before and after letter. You may want to run this with just one letter at a time using Replace All

10-12-2020, 03:34 PM	#675
hobnail Running with scissors Posts: 1,597 Karma: 14328510 Join Date: Nov 2019 Device: none	I don't understand why this isn't working; my search string is: <a id="Page_([xvi]+)\|([\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+)\|([\d]+)\]"></a> When the file contains <a id="Page_iv" class="x-ebookmaker-pageno" title="[iv]"></a> and I click on the Find button, it highlights only <a id="Page_i What's wrong with my regexp?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Examples of Subgroups	emonti8384	Lounge	32	02-26-2011 07:00 PM
Accessories Pen examples	Gunnerp245	enTourage Archive	15	02-21-2011 04:23 PM
Stylesheet examples?	Skitzman69	Sigil	15	09-24-2010 09:24 PM
Examples	kafkaesque1978	iRiver Story	1	07-26-2010 04:49 PM
Looking for examples of typos in eBooks	Tonycole	General Discussions	1	05-05-2010 05:23 AM