Regex examples - Page 47

1v4n0 · 01-08-2022, 09:43 AM

Is there a regex that finds all the letters and only the letters, including accented ones and, just as example, š č ć ž đ? Or do I have to manually add the to the range, as in [a-zèòéùàšđčćž]? I always risk leaving some aside.

Thanks.

Doitsu · 01-08-2022, 09:52 AM

Quote:

Originally Posted by 1v4n0

Is there a regex that finds all the letters and only the letters, including accented ones and, just as example, š č ć ž đ?

\p{Ll} will find all lower case Unicode letters.
\p{Lu} will find all upper case Unicode letters.

1v4n0 · 01-08-2022, 10:55 AM

Quote:

Originally Posted by Doitsu

\p{Ll} will find all lower case Unicode letters.
\p{Lu} will find all upper case Unicode letters.

And \p{L} gives all Unicode letters, regardless of case. Thanks! I can't karma you but if I could I would

stumped · 02-06-2022, 05:06 AM

i have been using this wonderful code for years, but I confess I still don't know how it works
will some kind person talk me thru it, symbol by symbol please
remove a href
no replace needed
</?a ?([^>]+)?>

BeckyEbook · 02-06-2022, 06:54 AM

@stumped:
If you want to thoroughly understand regular expressions, not just your example, I recommend that you have a look here:
https://regex101.com/r/CXf9WD/1

stumped · 02-06-2022, 07:29 AM

i dont use regex often enough to retain a thorough understanding. I know enough to write simple formulas for find replace within ebooks, but this one is too dense to follow, even via the previous link

I conceptualise it as need to find stuff which begins <a then then delete up to and including a matching /a>

when I blindly apply it in sigil, it has a 100% success rate in stripping all the <a from entire books automatically , so it may be catering for some tricky edge - cases ?

BeckyEbook · 02-06-2022, 07:34 AM

You think well.
In short, it searches for <(possibly /)a (possibly anything)>
So it will search for all opening anchors with any existing attributes and all closing anchors.

DiapDealer · 02-06-2022, 09:06 AM

I can break it down, but it will be a little later.

DiapDealer · 02-06-2022, 11:43 AM

</?a ?([^>]+)?>

The question marks are used to mark what comes before as optional.

So </?a is saying that the slash before the 'a' tag is optional. That means it matches both "<a"and "</a".

Then comes the space, which is also made optional, meaning it will match "<a", or "<a ".

The ([^>]+)? is a little more tricky, but not terribly so. The parentheses are used to group everything before the last question mark. Meaning the whole of what's inside the parentheses is optional.

"[^>]" is a common character class when trying to parse html tags. It simply means that it will match any character that is not (^) the greater-than character (>). It's used to ensure that the expression does not get greedy and grab content beyond the ending of the current tag (>). The + is for repetition. + is one or more times, and * means 0 or more times.

The use of + in this case is why the grouping parentheses and the question mark to make the whole thing optional is necessary. In this particular case: the optional space character and the ([^>]+)? could be replaced with simply [^>]*
(meaning match all characters (except >) zero or more times, instead of all characters (except >) one or more times... optionally).

Then match the closing > character.

</?a ?([^>]+)?>

should be synonymous with:

</?a[^>]*>

for the stripping of all opening and closing anchor tags (as well as any self-closing anchor tags of the variety: <a id="anchor_tag_1" />)

But no need to change what works. I included the slight simplification for explanatory purposes.

BillPearl · 02-24-2022, 10:12 AM

Perhaps this may help you on your way.
Recently had to find missing first " of a pair. Finally came up with this:

Find strings with a missing first quote
... calibre3">((?:\\"|[^"])*")</p>
^^^^^--------tags ------^^^ bracket string

Search: ...calibre3">([")(?: (?= (\\?))\2.)*?\1</p> -between tags (no space after ?: ) ?

Replace: "\1 <-- (There's a space after the '1') -adds quote to front end of \1 (the captured text)

Did not work in all cases but this got the rest of the mis-matched pairs.
3">((?:\\"|[^"])*")[. , ?] -between tag and punctuation mark(s)

Quote:

Originally Posted by meme

I'd like to see if I can collect Regular Expressions (PCRE format as introduced in Sigil 0.5.0) used for common or difficult issues, and maybe add them to the FAQ, etc. Partly so I can have a list to refer to when needed, but also to collect a lot of what's probably already been mentioned in this forum. And maybe to find out if there isn't a way to do a replacement that's needed.

For instance, is there a regex to do other types of replacement but only inside body tags?

Is there one only for the actual text - words not part of a tag name or attribute? Words that are only aprt of a tag name or attribute?

If you have any suggestions for the above cases, or any other useful Regex expressions please post them.

BillPearl · 02-24-2022, 10:45 AM

Perhaps this may help you on your way.
Recently had to find missing first " of a pair. Finally came up with this:

Find strings with a missing first quote
calibre3">((?:\\"|[^"])*")</p>
^^^^^--------tags ------^^^ bracket string

calibre3">"\1</p> -adds quote to front end of \1 (the captured text)

Quote:

Originally Posted by meme

I'd like to see if I can collect Regular Expressions (PCRE format as introduced in Sigil 0.5.0) used for common or difficult issues, and maybe add them to the FAQ, etc. Partly so I can have a list to refer to when needed, but also to collect a lot of what's probably already been mentioned in this forum. And maybe to find out if there isn't a way to do a replacement that's needed.

For instance, is there a regex to do other types of replacement but only inside body tags?

Is there one only for the actual text - words not part of a tag name or attribute? Words that are only aprt of a tag name or attribute?

If you have any suggestions for the above cases, or any other useful Regex expressions please post them.

Ashjuk · 04-04-2022, 09:45 AM

I have a book that contains a lot of complex IDs that I am trying to replace with simple ones via a regex find and replace.

The IDs have strings like this - F8901-6c93446a08e5490e8e6b029bcac88fe9

I know how to use [a-z]+ and [0-9]+ but I don't know how to relate that to these complex strings. Unfortunately they appear to be completely random without a common start or end character.

I have tried a few examples that I found by searching on-line, but nothing seems to work for me.

All help gratefully received.

BeckyEbook · 04-04-2022, 10:05 AM

Start with something similar:

Code:

id="[A-Fa-f0-9-]+"

Ashjuk · 04-04-2022, 10:43 AM

Quote:

Originally Posted by BeckyEbook

Start with something similar:

Code:

id="[A-Fa-f0-9-]+"

Thank you so much Becky, that worked great.

I did not realise you could combine alphanumeric characters within one set of brackets.

I had been trying [A-z]+[a-z]+[0-9]+ without success.

theducks · 04-04-2022, 08:51 PM

id=".+?"
would find ID= with any ID value between quotes

02-06-2022, 09:06 AM	#698
DiapDealer Grand Sorcerer Posts: 27,549 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	I can break it down, but it will be a little later. Last edited by DiapDealer; 02-06-2022 at 11:43 AM.

02-06-2022, 11:43 AM	#699
DiapDealer Grand Sorcerer Posts: 27,549 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	</?a ?([^>]+)?> The question marks are used to mark what comes before as optional. So </?a is saying that the slash before the 'a' tag is optional. That means it matches both "<a"and "</a". Then comes the space, which is also made optional, meaning it will match "<a", or "<a ". The ([^>]+)? is a little more tricky, but not terribly so. The parentheses are used to group everything before the last question mark. Meaning the whole of what's inside the parentheses is optional. "[^>]" is a common character class when trying to parse html tags. It simply means that it will match any character that is not (^) the greater-than character (>). It's used to ensure that the expression does not get greedy and grab content beyond the ending of the current tag (>). The + is for repetition. + is one or more times, and * means 0 or more times. The use of + in this case is why the grouping parentheses and the question mark to make the whole thing optional is necessary. In this particular case: the optional space character and the ([^>]+)? could be replaced with simply [^>]* (meaning match all characters (except >) zero or more times, instead of all characters (except >) one or more times... optionally). Then match the closing > character. </?a ?([^>]+)?> should be synonymous with: </?a[^>]> for the stripping of all opening and closing anchor tags (as well as any self-closing anchor tags of the variety: <a id="anchor_tag_1" />) But no need to change what works. I included the slight simplification for explanatory purposes. Last edited by DiapDealer; 02-06-2022 at 11:49 AM.*

04-04-2022, 10:05 AM	#703
BeckyEbook Guru Posts: 692 Karma: 2180740 Join Date: Jan 2017 Location: Poland Device: Misc	Start with something similar: Code: id="[A-Fa-f0-9-]+"

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Examples of Subgroups	emonti8384	Lounge	32	02-26-2011 06:00 PM
Accessories Pen examples	Gunnerp245	enTourage Archive	15	02-21-2011 03:23 PM
Stylesheet examples?	Skitzman69	Sigil	15	09-24-2010 08:24 PM
Examples	kafkaesque1978	iRiver Story	1	07-26-2010 03:49 PM
Looking for examples of typos in eBooks	Tonycole	General Discussions	1	05-05-2010 04:23 AM

01-08-2022, 09:43 AM	#691
1v4n0 Groupie Posts: 171 Karma: 40000 Join Date: Oct 2013 Device: kindle	Is there a regex that finds all the letters and only the letters, including accented ones and, just as example, š č ć ž đ? Or do I have to manually add the to the range, as in [a-zèòéùàšđčćž]? I always risk leaving some aside. Thanks.

02-06-2022, 05:06 AM	#694
stumped Wizard Posts: 3,305 Karma: 10259306 Join Date: May 2016 Device: kobo forma, Kobo Libra, Huawei media Tab, fire HD10, PW3 HDX8.9,	i have been using this wonderful code for years, but I confess I still don't know how it works will some kind person talk me thru it, symbol by symbol please remove a href no replace needed </?a ?([^>]+)?>

02-06-2022, 06:54 AM	#695
BeckyEbook Guru Posts: 692 Karma: 2180740 Join Date: Jan 2017 Location: Poland Device: Misc	@stumped: If you want to thoroughly understand regular expressions, not just your example, I recommend that you have a look here: https://regex101.com/r/CXf9WD/1

02-06-2022, 07:29 AM	#696
stumped Wizard Posts: 3,305 Karma: 10259306 Join Date: May 2016 Device: kobo forma, Kobo Libra, Huawei media Tab, fire HD10, PW3 HDX8.9,	i dont use regex often enough to retain a thorough understanding. I know enough to write simple formulas for find replace within ebooks, but this one is too dense to follow, even via the previous link I conceptualise it as need to find stuff which begins <a then then delete up to and including a matching /a> when I blindly apply it in sigil, it has a 100% success rate in stripping all the <a from entire books automatically , so it may be catering for some tricky edge - cases ?

02-06-2022, 07:34 AM	#697
BeckyEbook Guru Posts: 692 Karma: 2180740 Join Date: Jan 2017 Location: Poland Device: Misc	You think well. In short, it searches for <(possibly /)a (possibly anything)> So it will search for all opening anchors with any existing attributes and all closing anchors.

04-04-2022, 09:45 AM	#702
Ashjuk Fanatic Posts: 500 Karma: 3498633 Join Date: May 2011 Location: Surrey, UK Device: Kobo Aura One, Sony PRS 600/650	I have a book that contains a lot of complex IDs that I am trying to replace with simple ones via a regex find and replace. The IDs have strings like this - F8901-6c93446a08e5490e8e6b029bcac88fe9 I know how to use [a-z]+ and [0-9]+ but I don't know how to relate that to these complex strings. Unfortunately they appear to be completely random without a common start or end character. I have tried a few examples that I found by searching on-line, but nothing seems to work for me. All help gratefully received.

04-04-2022, 08:51 PM	#705
theducks Well trained by Cats Posts: 29,801 Karma: 54830978 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	id=".+?" would find ID= with any ID value between quotes

Advert

Advert