remove all a href....

cybmole · 08-08-2014, 03:18 PM

has there been any discussion of / any requests for a tool to remove all <a href= type links from within a book ( excludiing the ones which have to be there in xhtml headers.
I quick "search this forum" did not find anything.

I find these a pain to remove, as they are often entangled within spaghetti like chapter headers code, where they link back to a HTML TOC page; or they are arbitrarily added to place names/ addresses etc in a story.

In both cases , seeing blue underlined stuff is intrusive, and as I usually remove any HTML toc page from my personal reading copies, , I end up with broken links which upset my e-reader software if I tap the accidentally.

so iit would be great for me if there was an editor function or a calibre conversion option that just zapped them all away

eschwartz · 08-08-2014, 03:25 PM

You could do it with a regex, but I think DiapDealer's sample editor plugin might do a better job. Will require manually configuring the config file. Nope, doesn't seem to be configurable. He might add in link tag support, but in the meantime...

Hmmm. From a suggestion of mine in the Modify EPUB expansion discussion: https://www.mobileread.com/forums/sho...83#post2801083

Search:

Code:

<a href="[^<>]*">((?:(?!<(?:a|/a)).)*)</a>

Replace:

Code:

\1

How does that look? (In terms of working, not reading.

)

Should handle nested tags. (And a byproduct is that if, for some godawful reason there are nested link tags which should NEVER happen, it'd still work. I could probably do this the short way, then, with lazy searching, but I like this masterpiece, plus I like copy-pasting previous solutions.

)

JSWolf · 08-08-2014, 04:41 PM

Before you remove all <a href= code, take a look to make sure that you are not removing any that matter.

A lot of them are used as filler to show where the page number is in a paper book. And then there are the ones that link back to the HTML ToC. Those can go too. And they look awful.

eschwartz · 08-08-2014, 04:45 PM

Quote:

Originally Posted by JSWolf

Before you remove all <a href= code, take a look to make sure that you are not removing any that matter.

A lot of them are used as filler to show where the page number is in a paper book. And then there are the ones that link back to the HTML ToC. Those can go too. And they look awful.

Unless I miss my guess, cybmole wants to zap EVERYTHING in the <body>. I assume he thought it out. In any event, backup copies are always good.

@cybmole, if it's in the header it would be a <link> to attach resources, not an <a> to create a clickable hyperlink.

JSWolf · 08-08-2014, 04:57 PM

Quote:

Originally Posted by eschwartz

Unless I miss my guess, cybmole wants to zap EVERYTHING in the <body>. I assume he thought it out. In any event, backup copies are always good.

But, it can be rather annoying to put in all the work to modify the eBook and then find out you goofed and have to start over.

cybmole · 08-08-2014, 05:20 PM

i know from bitter experience that i must not zap these

Code:

 <link href="../Styles/stylesheet.css" rel="stylesheet" type="text/css" />
  <link href="../Styles/page_styles.css" rel="stylesheet" type="text/css" />

but if we talk about novels, not texts with footnotes, then I struggle to name any useful use of <a href

where I have gone previously wrong is to tackle the header cleanup in chunks, so I've tried to zap the href bit only, with a view to then losing the <a> tags but preserving what is inside of them. But that takes out the vital links code.

Getting the <a href... and the closing </a> tag out of retail code like this example ( that I quoted in sigil forum) is no fun!

Code:

<h2 class="chp"><a href="../Text/wizardandglass_con01.html#TOCC-6"><span class="chapnum"><b>CHAPTER I</b></span><br />
    B<span class="largecap">ENEATH THE</span> D<span class="largecap">EMON</span> M<span class="largecap">OON</span> (I)</a></h2>

my 1st get out of jail card is the discard option in sigil - exit without saving.
my 2nd is to go to my backup copy of calibre library, restore the backup of that book into my main library, & start over
I needed both cards, more than once, when tweaking the above code!

->eschwartz: I'll give your code a try-out on the next nasty I come across- thanks

eschwartz · 08-08-2014, 05:29 PM

Did you give my code a try? It should nuke <a href="link-location">content</a> pairs -- all of them -- while preserving any markup on the "content". If you need to fine-tune it any more, the important bit is the bit in multiple layers of parentheses. It uses the power of negative lookbehinds to match-any-string but ones that include the excluded stuff in red.

Nuke tag sets while preserving nested instances and other markup:

Code:

<tag-to-nuke(?: optional-attribute(s)="[^<>]*")?>((?:(?!<(?:tag-to-nuke|/tag-to-nuke)).)*)</tag-to-nuke>

Replace:

Code:

\1

DiapDealer · 08-08-2014, 07:42 PM

Quote:

Originally Posted by eschwartz

You could do it with a regex, but I think DiapDealer's sample editor plugin might do a better job. Will require manually configuring the config file. Nope, doesn't seem to be configurable. He might add in link tag support, but in the meantime...

The tags you can change stuff TO are configurable; but only through editing the JSON settings file manually, but otherwise you're right... the original tag you're looking for is not configurable.

I've had another request along the same lines (<a> tags). I'm thinking about adding the ability to remove/modify 'a' tags, but haven't put the time in yet. I figure it's not that critical since non-nestable tags (of which the anchor tag is one [99.9 percent of the time anyway]) are pretty trivial to regex away in well-formed (x)html.

Quote:

Originally Posted by cybmole

Getting the <a href... and the closing </a> tag out of retail code like this example ( that I quoted in sigil forum) is no fun!

Code:

<h2 class="chp"><a href="../Text/wizardandglass_con01.html#TOCC-6"><span class="chapnum"><b>CHAPTER I</b></span><br />
    B<span class="largecap">ENEATH THE</span> D<span class="largecap">EMON</span> M<span class="largecap">OON</span> (I)</a></h2>

Can o' corn! I'd do something like this (in calibre only):
Search for:

Code:

</?a\M([^>]+)?>

Replace it with: nothing. nada. zip.

That should remove opening/closing 'a' tags leaving the text between them alone (as well as removing self-closing <a id="blah" /> entries).

Or something like this (in calibre OR Sigil):

Code:

</?a ?([^>]+)?>

DiapDealer · 08-08-2014, 08:44 PM

I realize my approach might seem a bit "nuclear" (removing all opening/closing/self-closing anchor tags in a document), but if you step through one at a time and make sure you don't delete an open tag but skip the close-tag (or vice-versa), it's not so bad. And it certainly leaves "link" stuff in the header alone.

Besides, I think we can all agree that the mass removal of <a> tags is fraught with peril to begin with. Even if you only focus on the seemingly innocuous ones with href attributes, that doesn't mean the "id"s of those anchors aren't the targets of nav elements in the ncx file, or spine/guide elements in the opf. In fact, that's especially likely in the case of some of those chapter-header monstrosity structures you pointed out.

So with all that in mind ... it seemed like you were in a bit of the "Damn the torpedoes! ... I've got checkpoints set" frame of mind anyway.

kovidgoyal · 08-09-2014, 12:06 AM

If you dont like blue underlined stuff, simply add a couple of CSS rules to make links not show up as blue and underlined, instead of removing them

cybmole · 08-09-2014, 01:59 AM

Quote:

Originally Posted by DiapDealer;2892833..

Can o' corn! I'd do something like this (in calibre only):
Search for: [CODE

</?a\M([^>]+)?>[/CODE]

Replace it with: nothing. nada. zip.

That should remove opening/closing 'a' tags leaving the text between them alone (as well as removing self-closing <a id="blah" /> entries).

Or something like this (in calibre OR Sigil):

Code:

</?a ?([^>]+)?>

surely that strips the a tags but does not remove the href="whatever" bit ?

cybmole · 08-09-2014, 02:01 AM

Quote:

Originally Posted by kovidgoyal

If you don't like blue underlined stuff, simply add a couple of CSS rules to make links not show up as blue and underlined, instead of removing them

been there , tried that.
for your viewer, yes but for ade based readers-
They decide that they know best & that if its a link its going to render as blue+ underlined, no matter what you put in the CSS

cybmole · 08-09-2014, 02:05 AM

Quote:

Originally Posted by DiapDealer

Besides, I think we can all agree that the mass removal of <a> tags is fraught with peril ...

OK but I'd still like examples of where it could make sense to want them, in a basic story-telling novel - one that does not use any footnotes.

the basic concept of a story is you begin at the beginning and read through to the end! - you don't want to be thrown off course by tapping some blue bit by accident or out of curiosity- especially if it's going to throw up some crap about needing to turn your wi-fi back on !
& I've not looked deeply into footnote coding techniques, but does best practice for those need the a href constructs ?

cybmole · 08-09-2014, 02:18 AM

Quote:

Originally Posted by eschwartz

...

Hmmm. From a suggestion of mine in the Modify EPUB expansion discussion: https://www.mobileread.com/forums/sho...83#post2801083

Search:

Code:

<a href="[^<>]*">((?:(?!<(?:a|/a)).)*)</a>

Replace:

Code:

\1

How does that look? (In terms of working, not reading.

)

....

OK - so I looked at my current reads-in-progress for another test case.
your code did not fix the example below because there's a class after the <a

That's always going to be the case always if the book has gone through a calibre epub to epub conversion? , because calibre will add classes to every tag.
The previous example I gave was from a completely unedited/ unconverted retail book, but my usual workflow for making a personal reading version is to load original into calibre & immediately convert it epub-to-epub , then tweak only within the resulting copy, never touch the original_epub backup.
I use the convert to add extra CSS so as to zap hyphenation & zap widows & orphans at the same time.

Code:

 <h1 class="calibre10" id="rw-h1_319849-00001"><a class="calibre7" href="../Text/9780857900135_toc.html">4</a></h1>

I'd want to reduce all that to
<h1 class="calibre10">4</h1>
the ID tag is redundant i.e. does not impact the reading experience in any way ?

this find worked ok though:

Code:

<a class="calibre\d" href="[^<>]*">((?:(?!<(?:a|/a)).)*)</a>

DiapDealer · 08-09-2014, 02:34 AM

Quote:

Originally Posted by cybmole

surely that strips the a tags but does not remove the href="whatever" bit ?

I don't follow you. The href="whatever" bit is part of the 'a' tag. you can't strip the 'a' tags without removing the href. Just try the regex and you'll see exactly what it will remove. There's no need to wonder.

As for why you want to remove them (or why you don't think there's a 'normal' reason to want one in a typical novel), I don't really care. The fact is: nav entries in the ncx file quite often point to those anchors--as do the spine/guide elements of the opf. That makes blindly removing them quite risky, in my opinion. If you're 100% certain the nav entries and the spine/guide elements of your ebook's ncx/opf all point to HTML files directly (no URL fragments representing the ids of those 'a' tags), then of course the peril I spoke of is less. It has nothing to do with being a "basic storytelling novel", and everything to do with "that's just how some ebooks (even commercial ones) are constructed sometimes."

Quote:

& I've not looked deeply into footnote coding techniques, but does best practice for those need the a href constructs ?

The short answer is "yes." Though it's not really a "best practice" thing so much as a "there's really no other way to do it" thing (in epub2 anyway).

08-08-2014, 03:18 PM	#1
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	remove all a href.... has there been any discussion of / any requests for a tool to remove all <a href= type links from within a book ( excludiing the ones which have to be there in xhtml headers. I quick "search this forum" did not find anything. I find these a pain to remove, as they are often entangled within spaghetti like chapter headers code, where they link back to a HTML TOC page; or they are arbitrarily added to place names/ addresses etc in a story. In both cases , seeing blue underlined stuff is intrusive, and as I usually remove any HTML toc page from my personal reading copies, , I end up with broken links which upset my e-reader software if I tap the accidentally. so iit would be great for me if there was an editor function or a calibre conversion option that just zapped them all away Last edited by cybmole; 08-08-2014 at 03:25 PM.

08-08-2014, 03:25 PM	#2
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	You could do it with a regex, but I think DiapDealer's sample editor plugin might do a better job. Will require manually configuring the config file. Nope, doesn't seem to be configurable. He might add in link tag support, but in the meantime... Hmmm. From a suggestion of mine in the Modify EPUB expansion discussion: https://www.mobileread.com/forums/sho...83#post2801083 Search: Code: <a href="[^<>]">((?:(?!<(?:a\|/a)).))</a> Replace: Code: \1 How does that look? (In terms of working, not reading. ) Should handle nested tags. (And a byproduct is that if, for some godawful reason there are nested link tags which should NEVER happen, it'd still work. I could probably do this the short way, then, with lazy searching, but I like this masterpiece, plus I like copy-pasting previous solutions. ) Last edited by eschwartz; 08-08-2014 at 03:40 PM.

08-08-2014, 05:20 PM	#6
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	i know from bitter experience that i must not zap these Code: <link href="../Styles/stylesheet.css" rel="stylesheet" type="text/css" /> <link href="../Styles/page_styles.css" rel="stylesheet" type="text/css" /> but if we talk about novels, not texts with footnotes, then I struggle to name any useful use of <a href where I have gone previously wrong is to tackle the header cleanup in chunks, so I've tried to zap the href bit only, with a view to then losing the <a> tags but preserving what is inside of them. But that takes out the vital links code. Getting the <a href... and the closing </a> tag out of retail code like this example ( that I quoted in sigil forum) is no fun! Code: <h2 class="chp"><a href="../Text/wizardandglass_con01.html#TOCC-6"><span class="chapnum"><b>CHAPTER I</b></span><br /> B<span class="largecap">ENEATH THE</span> D<span class="largecap">EMON</span> M<span class="largecap">OON</span> (I)</a></h2> my 1st get out of jail card is the discard option in sigil - exit without saving. my 2nd is to go to my backup copy of calibre library, restore the backup of that book into my main library, & start over I needed both cards, more than once, when tweaking the above code! ->eschwartz: I'll give your code a try-out on the next nasty I come across- thanks Last edited by cybmole; 08-08-2014 at 05:33 PM.

08-08-2014, 05:29 PM	#7
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	Did you give my code a try? It should nuke <a href="link-location">content</a> pairs -- all of them -- while preserving any markup on the "content". If you need to fine-tune it any more, the important bit is the bit in multiple layers of parentheses. It uses the power of negative lookbehinds to match-any-string but ones that include the excluded stuff in red. Nuke tag sets while preserving nested instances and other markup: Code: <tag-to-nuke(?: optional-attribute(s)="[^<>]")?>((?:(?!<(?:tag-to-nuke\|/tag-to-nuke)).))</tag-to-nuke> Replace: Code: \1 Last edited by eschwartz; 08-08-2014 at 05:31 PM.

08-08-2014, 08:44 PM	#9
DiapDealer Grand Sorcerer Posts: 28,915 Karma: 207182180 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	I realize my approach might seem a bit "nuclear" (removing all opening/closing/self-closing anchor tags in a document), but if you step through one at a time and make sure you don't delete an open tag but skip the close-tag (or vice-versa), it's not so bad. And it certainly leaves "link" stuff in the header alone. Besides, I think we can all agree that the mass removal of <a> tags is fraught with peril to begin with. Even if you only focus on the seemingly innocuous ones with href attributes, that doesn't mean the "id"s of those anchors aren't the targets of nav elements in the ncx file, or spine/guide elements in the opf. In fact, that's especially likely in the case of some of those chapter-header monstrosity structures you pointed out. So with all that in mind ... it seemed like you were in a bit of the "Damn the torpedoes! ... I've got checkpoints set" frame of mind anyway. Last edited by DiapDealer; 08-08-2014 at 08:46 PM.

08-08-2014, 04:41 PM	#3
JSWolf Resident Curmudgeon Posts: 81,026 Karma: 150250725 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	Before you remove all <a href= code, take a look to make sure that you are not removing any that matter. A lot of them are used as filler to show where the page number is in a paper book. And then there are the ones that link back to the HTML ToC. Those can go too. And they look awful.

08-09-2014, 12:06 AM	#10
kovidgoyal creator of calibre Posts: 45,656 Karma: 28549046 Join Date: Oct 2006 Location: Mumbai, India Device: Various	If you dont like blue underlined stuff, simply add a couple of CSS rules to make links not show up as blue and underlined, instead of removing them

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
What does the filepos parameter do in an href?	lunixer	ePub	6	03-16-2017 11:56 AM
Regex Solution to hidden href search?	MizSuz	Sigil	16	09-29-2012 08:40 PM
Why is a href needed in the manifest to validate?	wannabee	ePub	3	01-25-2012 12:40 AM
a href links working/not working	mimosawind	ePub	5	12-09-2011 01:42 PM
RFE: Remove remove tags in bulk edit	magphil	Calibre	0	08-11-2009 11:37 AM

Advert

Advert