04-15-2021, 07:43 AM | #1 |
Junior Member
Posts: 8
Karma: 10
Join Date: Mar 2017
Device: phone
|
Regex to remove html tags
I've been searching for a solution for hours, but haven't found any examples that help.
I want to search the file and remove all instances of <a id="pageXXX"></a> where XXX is the page number. I have tried (^<a id="page)(.*:?)("></a>) (^<a id=\\"page)(.*:?)(\\"></a>) (^<a id="page)([0-9]+)("></a>) (^<a id=\\"page)([0-9]+)(\\"></a>) What am I missing? |
04-15-2021, 08:00 AM | #2 |
Junior Member
Posts: 8
Karma: 10
Join Date: Mar 2017
Device: phone
|
Found the answer.
(<a id=\"page)(.*:?)(\"></a>) |
Advert | |
|
04-15-2021, 08:04 AM | #3 |
Guru
Posts: 692
Karma: 2180740
Join Date: Jan 2017
Location: Poland
Device: Misc
|
Code:
<a id="page\d+"></a> You will understand everything. |
04-15-2021, 08:56 AM | #4 |
Grand Sorcerer
Posts: 27,552
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
What's the colon for?
Code:
(.*:?) I'd probably use something like: Code:
<a id="page\d+"></a> Code:
<a id="page[^>]+"></a> |
04-15-2021, 09:07 AM | #5 |
Guru
Posts: 970
Karma: 4999999
Join Date: Mar 2009
Location: Rosario, Argentina
Device: SONY PRS-505, PRS-T2
|
Disregard
|
Advert | |
|
04-15-2021, 10:16 AM | #6 |
Resident Curmudgeon
Posts: 73,998
Karma: 128903378
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Search: <a id="page[0-9]+"></a>
Replace: |
04-15-2021, 03:25 PM | #7 | |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Step 1. Find the link with an id of "page":
Step 2. Find the numbers:
Step 3. Find the closing quote + end of the link:
Steps #1 and #3 are simpler. You can just type those in just like a normal search! But #2 is a little tricky: How do you search for numbers in Regex? Instead of doing 9 separate searches for:
you can instead say: "Hey, after 'page', look for a number!" This is where Regex's special symbols come into play: Brackets [] stand for: "Look for a single character that is in this spot." So [0123456789] says: "Hey, look for the number 0 OR the number 1 OR the number 2 ... OR the number 9". Brackets are also special—you can also put in RANGES of characters: Regex: page[0-9] That says "Find the word 'page', then a number zero THROUGH nine". But I don't just want to find single number... I want lots of numbers. How do I do that? The plus sign + stands for "ONE OR MORE of the previous thing." Regex: page[0-9]+ Now this says: "Find 'page', then find ONE OR MORE numbers zero through nine." Putting It All Together Let me color-code the 3 pieces:
so your combined regex will be: Search: <a id="page[0-9]+"></a> which will match: <a href="page1"></a> <a href="page27"></a> <a href="page123"></a> <a href="page999"></a> <a href="page123456"></a> * * * * * Extra: Regex's Special Symbol: \d Just like the plus sign is a special symbol, there are also a few others. Instead of typing "[0-9]" "[0-9]" "[0-9]" all the time, there's a shortcut for that: \d = "Matches any number" So these 2 are equivalent:
So this says: "Find ONE OR MORE of any number zero through nine":
and this says the same exact thing!:
So the searches recommended by JSWolf + BeckyEbook do the same thing: Search: <a id="page[0-9]+"></a> Search: <a id="page\d+"></a> Last edited by Tex2002ans; 04-15-2021 at 03:34 PM. |
|
04-15-2021, 03:33 PM | #8 |
Resident Curmudgeon
Posts: 73,998
Karma: 128903378
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
@Tex2002ans excellent explanation about the regex used for this search.
|
04-16-2021, 03:05 PM | #9 | |
Well trained by Cats
Posts: 29,809
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
It accepts arguments |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Regex to find multiple spaces between HTML tags | mikapanja | Editor | 10 | 11-18-2017 07:11 AM |
HTML input plugin stripping text within toc tags in child html file | nimblebooks | Conversion | 3 | 02-21-2012 03:24 PM |
html import remove userdefined Tags | gucky | Calibre | 0 | 11-14-2010 09:35 AM |
Regex help to remove HTML footer | neonbible | Calibre | 4 | 09-09-2010 09:42 AM |
RFE: Remove remove tags in bulk edit | magphil | Calibre | 0 | 08-11-2009 10:37 AM |