Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 04-15-2021, 07:43 AM   #1
vijer
Junior Member
vijer began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Mar 2017
Device: phone
Regex to remove html tags

I've been searching for a solution for hours, but haven't found any examples that help.

I want to search the file and remove all instances of <a id="pageXXX"></a> where XXX is the page number.

I have tried

(^<a id="page)(.*:?)("></a>)

(^<a id=\\"page)(.*:?)(\\"></a>)

(^<a id="page)([0-9]+)("></a>)

(^<a id=\\"page)([0-9]+)(\\"></a>)

What am I missing?
vijer is offline   Reply With Quote
Old 04-15-2021, 08:00 AM   #2
vijer
Junior Member
vijer began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Mar 2017
Device: phone
Found the answer.

(<a id=\"page)(.*:?)(\"></a>)
vijer is offline   Reply With Quote
Advert
Old 04-15-2021, 08:04 AM   #3
BeckyEbook
Guru
BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.
 
BeckyEbook's Avatar
 
Posts: 899
Karma: 3501166
Join Date: Jan 2017
Location: Poland
Device: Various
Code:
<a id="page\d+"></a>
I recommend the website https://regex101.com/
You will understand everything.
BeckyEbook is offline   Reply With Quote
Old 04-15-2021, 08:56 AM   #4
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 28,848
Karma: 207000000
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by vijer View Post
Found the answer.

(<a id=\"page)(.*:?)(\"></a>)
What's the colon for?
Code:
(.*:?)
Are there really optional colons in the page number that need to be accounted for?

I'd probably use something like:

Code:
<a id="page\d+"></a>
or

Code:
<a id="page[^>]+"></a>
It's probably not an issue here, but (.*?) is quite greedy and can cause problems when trying to search html. [^>]* or [^>]+ are typically safer if you need to match multiple unknown characters but need to guarantee it never spans multiple tags.
DiapDealer is offline   Reply With Quote
Old 04-15-2021, 09:07 AM   #5
Pablo
Guru
Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.
 
Pablo's Avatar
 
Posts: 974
Karma: 4999999
Join Date: Mar 2009
Location: Rosario, Argentina
Device: SONY PRS-T2, Kindle Paperwhite 11th gen
Disregard
Pablo is offline   Reply With Quote
Advert
Old 04-15-2021, 10:16 AM   #6
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 80,655
Karma: 150249619
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Search: <a id="page[0-9]+"></a>
Replace:
JSWolf is online now   Reply With Quote
Old 04-15-2021, 03:25 PM   #7
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by vijer View Post
I want to search the file and remove all instances of <a id="pageXXX"></a> where XXX is the page number.

I have tried

[...]

What am I missing?
Let's break it down into 3 separate pieces:

Step 1. Find the link with an id of "page":
  • <a id="page

Step 2. Find the numbers:
  • ????????

Step 3. Find the closing quote + end of the link:
  • "></a>

Steps #1 and #3 are simpler. You can just type those in just like a normal search!

But #2 is a little tricky:

How do you search for numbers in Regex?

Instead of doing 9 separate searches for:
  • page1
  • page2
  • page3
  • [...]
  • page9

you can instead say: "Hey, after 'page', look for a number!"

This is where Regex's special symbols come into play:

Brackets [] stand for: "Look for a single character that is in this spot."

So [0123456789] says: "Hey, look for the number 0 OR the number 1 OR the number 2 ... OR the number 9".

Brackets are also special—you can also put in RANGES of characters:

Regex: page[0-9]

That says "Find the word 'page', then a number zero THROUGH nine".

But I don't just want to find single number... I want lots of numbers. How do I do that?

The plus sign + stands for "ONE OR MORE of the previous thing."

Regex: page[0-9]+

Now this says: "Find 'page', then find ONE OR MORE numbers zero through nine."

Putting It All Together

Let me color-code the 3 pieces:
  • Step 1: <a id="page
  • Step 2: [0-9]+
  • Step 3: "></a>

so your combined regex will be:

Search: <a id="page[0-9]+"></a>

which will match:

<a href="page1"></a>
<a href="page27"></a>
<a href="page123"></a>
<a href="page999"></a>
<a href="page123456"></a>


* * * * *

Extra: Regex's Special Symbol: \d

Just like the plus sign is a special symbol, there are also a few others.

Instead of typing "[0-9]" "[0-9]" "[0-9]" all the time, there's a shortcut for that:

\d = "Matches any number"

So these 2 are equivalent:
  • [0-9]
  • \d

So this says: "Find ONE OR MORE of any number zero through nine":
  • [0-9]+

and this says the same exact thing!:
  • \d+

So the searches recommended by JSWolf + BeckyEbook do the same thing:

Search: <a id="page[0-9]+"></a>
Search: <a id="page\d+"></a>

Last edited by Tex2002ans; 04-15-2021 at 03:34 PM.
Tex2002ans is offline   Reply With Quote
Old 04-15-2021, 03:33 PM   #8
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 80,655
Karma: 150249619
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
@Tex2002ans excellent explanation about the regex used for this search.
JSWolf is online now   Reply With Quote
Old 04-16-2021, 03:05 PM   #9
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 31,240
Karma: 61360164
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by vijer View Post
I've been searching for a solution for hours, but haven't found any examples that help.

I want to search the file and remove all instances of <a id="pageXXX"></a> where XXX is the page number.

I have tried

(^<a id="page)(.*:?)("></a>)

(^<a id=\\"page)(.*:?)(\\"></a>)

(^<a id="page)([0-9]+)("></a>)

(^<a id=\\"page)([0-9]+)(\\"></a>)

What am I missing?
Why not use the Diaps Toolbag Plugin to remove <a tags
It accepts arguments
theducks is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regex to find multiple spaces between HTML tags mikapanja Editor 10 11-18-2017 07:11 AM
HTML input plugin stripping text within toc tags in child html file nimblebooks Conversion 3 02-21-2012 03:24 PM
html import remove userdefined Tags gucky Calibre 0 11-14-2010 09:35 AM
Regex help to remove HTML footer neonbible Calibre 4 09-09-2010 09:42 AM
RFE: Remove remove tags in bulk edit magphil Calibre 0 08-11-2009 10:37 AM


All times are GMT -4. The time now is 06:52 AM.


MobileRead.com is a privately owned, operated and funded community.