View Full Version : Strip all formatting?

06-27-2013, 03:19 AM
Morning all.

As I'm having some difficulty with the formatting of a couple of epubs I have I would like to work out some method for stripping out all of the formatting as it stands and then applying some basic formatting rules after that.
I use Calibre to manage my books but I also have Sigil installed from an abortive attempt to reformat a malformed book (a whole series of books mashed together in one humongous file - it didn't go well!)
06-27-2013, 03:51 AM
Delete all CSS files, all <style> elements in the headers and all "style=..." attributes in other elements.

06-27-2013, 05:00 AM
Be careful not to remove any bolds or italics, which quite frankly are the essence - or the soul of a book.

Edit: I know that ABBYY FineReader will set various styles and sizes, when in fact there's just one size (and style) for the body text throughout the entire book!

It also assigns styles for bolds and italics, when in fact it should be using "<em>" and "<strong>" tags.

Edit #2: In order to strip the bad FineReader formatting, I always export from FineReader as RTF, open it in Word 2010 and run my custom macro (from the signature). Then I redo the layout (using styles) in InDesign - or, alternatively in Word using quick styles and then export as HTML.

06-27-2013, 05:57 AM
On a few desperate occasions, I've imported the epub xhtml files into a browser and saved it as pure text. Since I am a command line fanatic, I used lynx (lynx -dump file.xhtml > file.txt.
To preserve some of the formatting like italic and bold, I did a search and replace on the xhtml file first (e.g. <em> -> {em})