View Single Post
Old 09-29-2012, 04:16 PM   #7
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
I'd suggest not killing too many tags in a single go, while it can work well, if you make mistakes it's often pretty costly. Rather I'd use a few simpler expressions to weed out the unwanted elements, then finally do a cleaning pass for empty spans (or anything else). For example:

anchors that seem to refer to a local filesystem (dos style) - note your example did not have a closing a tag?
Code:
<a\b[^<>]*?[[:alpha:]]:/[^<>]*?>
images which refer to next/prev
Code:
<img[^<>]*?alt="(next|prev)"[^<>]*/>
empty tags (leaves tags containing nbsp, to be safe)
Code:
<(\w+)\b[^<>]*>\s*</\1>
Added bonus:
The next one is better than the simple anchor example above. When you load an epub into sigil, all of the text will be stuck in the Text directory, links that refer to them will use the relative paths, like href="../Text/Blah.xhtml" . This looks for anything which does not start with the .. (one level up), so it also catches references to external content (sites and such, hello watermarks). It will find any tags, as well as stuff inside them - so be careful and grep first.
Code:
<(\w+)\b[^<>]*?(href|src)="(?!\.\.)[^"]*?"[^<>]*?(/>|.*?(?!<\1)</\1>)
Serpentine is offline   Reply With Quote