View Single Post
Old 02-24-2013, 10:46 AM   #2
theducks
Grand Sorcerer
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 14,630
Karma: 5628865
Join Date: Aug 2009
Location: (The original) Silicon Valley, USA
Device: Galaxy Tab 2, Astak Pocket Pro, K4NT
Quote:
Originally Posted by Dybbuk View Post
I've been looking for a way on Sigil to delete everything in an epub but the stuff between <p> tags. In other words, to remove everything in a file but <p.*/p>.

It's easy to remove all the non p-tags with a regex - and wind up with plain text - but I'm stumped about how to remove all the non p-tags except the ones within paragraphs (such as <span>, <em>, etc.). I've Googled around and the consensus seems to be that regex is useless for parsing nested HTML tags. Is that really true?
Why not use Calibre to convert to TXT?
The Convert the (cleaned of all style) TXT to HTML/EPUB first.

IIRC Notepad++ can strip HTML
theducks is offline   Reply With Quote