MobileRead Forums - View Single Post

theducks · 02-24-2013, 10:46 AM

Quote:

Originally Posted by Dybbuk

I've been looking for a way on Sigil to delete everything in an epub but the stuff between <p> tags. In other words, to remove everything in a file but <p.*/p>.

It's easy to remove all the non p-tags with a regex - and wind up with plain text - but I'm stumped about how to remove all the non p-tags except the ones within paragraphs (such as <span>, <em>, etc.). I've Googled around and the consensus seems to be that regex is useless for parsing nested HTML tags. Is that really true?

Why not use Calibre to convert to TXT?
The Convert the (cleaned of all style) TXT to HTML/EPUB first.

IIRC Notepad++ can strip HTML