How to eliminate 1000 fonts in epub

Ripplinger · 12-06-2011, 02:20 AM

I scanned a book and then used MS Word to do all the proof-reading and editing, saved as html and then am using Sigil to clean it up. I have a feeling MS Word was the wrong program to use. It first put everything into text boxes (which was a pain to get rid of those), but it also added hundreds of various fonts and styles from the scan which I would like to eliminate. I don't mind a few pages having special fonts, like the table of contents or the publishing page, but throughout the body of the book I'd like just one font.

Is there an easy way to trim all the font classes and font styles that are in the epub? I originally was just going to copy/paste it back with no formatting, but that method also killed any italics, and there's a lot throughout the book and important to me to keep them as italics (without having to read line-by-line through the entire book again to find them to italicize, that would be painful, there's that many of them).

Any suggestions how to easily do this? I did try letting Calibre convert epub to epub, and even that stylesheet was over 2000 lines long. The size of the epub would probably be cut in half as well, it's at 480KB now. The epub does work, even linking the table of contents takes you to the correct chapter, but it has very slight changes of the font and spacing even on the same page throughout the book.

Thanks for any suggestions.

FizzyWater · 12-06-2011, 04:31 AM

Did you save it as simple HTML? I may not be using the right terminology, but I know it's recommended when using Word for HTML to save it with the stripped/simple version (that strips out all the custom Office codes).

Toxaris · 12-06-2011, 04:57 AM

You need to save it as filtered HTML. But, if you have everything in text-boxes, something else did not exactly worked right.

Anyway, those font settings can be removed quite quickly with various RegExp search and replaces.

DSpider · 12-06-2011, 05:36 AM

Word didn't put the text in text boxes. FineReader did, when you exported as "Exact Copy".

Word didn't add those styles either. FineReader v10 (don't know about version 11 yet) creates separate styles for bold and italic characters, and even MULTIPLE styles for the exact same bold or italic characters (instead of just using a previous style). Come to think of it, it's probably because the pages weren't perfectly flat when they were scanned, and/or the author sparsely used a different character spacing for various lines (so that they fit on the page). Hmmm... So it may not be FineReader's fault after all, for all these "ghost" styles...

However, there is a way to export to plain .txt without losing all the bold, italics and any other formatting you'd like, which would essentially eliminate all the styles. Check out this thread: https://www.mobileread.com/forums/sho...d.php?t=153764

Ripplinger · 12-06-2011, 10:32 AM

Quote:

Originally Posted by DSpider

https://www.mobileread.com/forums/sho...d.php?t=153764

Thank you! That did it in the least painless way. I'll still have to do some manual formatting and break it back up into separate chapters, but with 838 italics found over 280+ printed pages (390 pages on my reader), you saved my sanity.

Thanks for all the info about where I went wrong, it probably was a setting in FineReader as mentioned. I did try saving in filtered html btw, but the 1000 fonts were still there, it didn't strip it of that formatting for reason.

Karl Murks · 12-08-2011, 05:58 AM

Quote:

Originally Posted by DSpider

Word didn't put the text in text boxes. FineReader did, when you exported as "Exact Copy".

FineReader's page formatting is pretty shitty for the most part. I've yet to see a single document I scanned where it produced good results. I suffered through too much trouble before finding out that saving to a file format that doesn't have these obnoxious textboxes is really the only way to get a decent document.

What I do most of the time is to save as HTML, then load the saved document in a text editor (Notepad++) and remove the stylesheet references by looking for all '<span class=' occurences and replacing them until none are left. I trust this method much more than relying on word processing software to keep a clean file and can be certain to be rid of all font changes FineReader inserts. After this process the only thing left will be italic, bold and similar effects.

Darqref · 12-09-2011, 02:27 AM

Quote:

Originally Posted by Karl Murks

FineReader's page formatting is pretty shitty for the most part. I've yet to see a single document I scanned where it produced good results. I suffered through too much trouble before finding out that saving to a file format that doesn't have these obnoxious textboxes is really the only way to get a decent document.

What I do most of the time is to save as HTML, then load the saved document in a text editor (Notepad++) and remove the stylesheet references by looking for all '<span class=' occurences and replacing them until none are left. I trust this method much more than relying on word processing software to keep a clean file and can be certain to be rid of all font changes FineReader inserts. After this process the only thing left will be italic, bold and similar effects.

I've been using an older copy of Omnipage 16, with my source a digital camera taking pictures of the pages. Omnipage has an export format called Wordpad(RTF). From experience, I know that Wordpad (an old application found in older versions of windows, and probably still buried in the newer ones) uses a "limited" version of RTF. The text comes out much cleaner, and standard procedures of selecting all text and making it the same font, etc., will tend to clean up the rest. Then, the rtf doc can be opened bu Word, for further text manipulation.

When I worked at MS as a test contractor, we called the Wordpad version of RTF the "true" RTF, and the Word version "Woozle" (dunno why, that's just what everyone else called it.) It had extended features of RTF, that was the MS way of trying to make it sorta proprietary and not truly compatible with other applications' implementation.

Even now, if you can find a version of Wordpad on your computer (look in the Accessories start menu), you should be able have Word save as an RTF, open that RTF with Wordpad (which will ignore the features it doesn't understand), then save as again from Wordpad to a different RTF filename. Voila, extra strange formatting gone, since Wordpad doesn't do styles.

Karl Murks · 12-09-2011, 07:46 AM

Well, my HTML method is definitely faster:
Save - open in Notepad - do a handful of search&replace operations - load the cleaned HTML with your word processor of choice. No more font garbage - just nice clean text.

But I have to agree about RTF. It could be such a nice format if it hadn't been overloaded with all sorts of crap which modern word processors just thoughtlessly add to the file.

GMcG · 12-12-2011, 08:37 AM

To Ripplinger:

In a word processor like Word you may 'select all' and then choose the font you want. By this you will get only one single font for the whole document and the bold and italic parts are still there.

George

DSpider · 12-12-2011, 08:52 AM

I still vote working your way up from plain text (without losing the bolds and italics) from Word:

1. Using FineReader export to DOCX as "Formatted Text". For illustrative purposes, here's a screenshot of FineReader 11 (they've added icons now, no more confusion resulting in thousands of text boxes):

2. Use the method I mentioned earlier to save as Plain Text, which will give you a squeaky clean document with bolds and italics, and do the layout - I always use my own customized 'Quick Styles' here. Much faster.

3. Export as HTML (maybe try Filtered HTML). Then go from there.

12-06-2011, 02:20 AM	#1
Ripplinger 350 Hoarder Posts: 3,587 Karma: 8281267 Join Date: Dec 2010 Location: Midwest USA Device: Sony PRS-350, Kobo Glo & Glo HD, PW2	How to eliminate 1000 fonts in epub I scanned a book and then used MS Word to do all the proof-reading and editing, saved as html and then am using Sigil to clean it up. I have a feeling MS Word was the wrong program to use. It first put everything into text boxes (which was a pain to get rid of those), but it also added hundreds of various fonts and styles from the scan which I would like to eliminate. I don't mind a few pages having special fonts, like the table of contents or the publishing page, but throughout the body of the book I'd like just one font. Is there an easy way to trim all the font classes and font styles that are in the epub? I originally was just going to copy/paste it back with no formatting, but that method also killed any italics, and there's a lot throughout the book and important to me to keep them as italics (without having to read line-by-line through the entire book again to find them to italicize, that would be painful, there's that many of them). Any suggestions how to easily do this? I did try letting Calibre convert epub to epub, and even that stylesheet was over 2000 lines long. The size of the epub would probably be cut in half as well, it's at 480KB now. The epub does work, even linking the table of contents takes you to the correct chapter, but it has very slight changes of the font and spacing even on the same page throughout the book. Thanks for any suggestions.

12-06-2011, 05:36 AM	#4
DSpider Evangelist Posts: 450 Karma: 343115 Join Date: Nov 2009 Location: Romania Device: PW2 2014	Word didn't put the text in text boxes. FineReader did, when you exported as "Exact Copy". Word didn't add those styles either. FineReader v10 (don't know about version 11 yet) creates separate styles for bold and italic characters, and even MULTIPLE styles for the exact same bold or italic characters (instead of just using a previous style). Come to think of it, it's probably because the pages weren't perfectly flat when they were scanned, and/or the author sparsely used a different character spacing for various lines (so that they fit on the page). Hmmm... So it may not be FineReader's fault after all, for all these "ghost" styles... However, there is a way to export to plain .txt without losing all the bold, italics and any other formatting you'd like, which would essentially eliminate all the styles. Check out this thread: https://www.mobileread.com/forums/sho...d.php?t=153764 Last edited by DSpider; 12-07-2011 at 09:17 AM.

12-12-2011, 08:52 AM	#10
DSpider Evangelist Posts: 450 Karma: 343115 Join Date: Nov 2009 Location: Romania Device: PW2 2014	I still vote working your way up from plain text (without losing the bolds and italics) from Word: 1. Using FineReader export to DOCX as "Formatted Text". For illustrative purposes, here's a screenshot of FineReader 11 (they've added icons now, no more confusion resulting in thousands of text boxes): 2. Use the method I mentioned earlier to save as Plain Text, which will give you a squeaky clean document with bolds and italics, and do the layout - I always use my own customized 'Quick Styles' here. Much faster. 3. Export as HTML (maybe try Filtered HTML). Then go from there. Last edited by DSpider; 12-12-2011 at 10:44 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How to merge and eliminate duplicates	clittle	Library Management	9	02-07-2011 05:33 PM
Classic Eliminate margins	primetime34	Barnes & Noble NOOK	6	12-26-2010 11:07 PM
eliminate iphone glare	scottjl	Apple Devices	2	04-29-2010 11:05 PM
ePub embedded fonts	JSWolf	Ectaco jetBook	9	09-14-2009 09:43 PM
Buy a 1000 Base and turn it into a 1000 S?	doctorow	iRex	5	09-24-2008 03:14 AM

12-06-2011, 04:31 AM	#2
FizzyWater You kids get off my lawn! Posts: 4,220 Karma: 73492664 Join Date: Aug 2007 Location: Columbus, Ohio Device: Oasis 2 and Libra H2O and half a dozen older models I can't let go of	Did you save it as simple HTML? I may not be using the right terminology, but I know it's recommended when using Word for HTML to save it with the stripped/simple version (that strips out all the custom Office codes).

12-06-2011, 04:57 AM	#3
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	You need to save it as filtered HTML. But, if you have everything in text-boxes, something else did not exactly worked right. Anyway, those font settings can be removed quite quickly with various RegExp search and replaces.

12-09-2011, 07:46 AM	#8
Karl Murks Member Posts: 10 Karma: 10 Join Date: Dec 2011 Device: none	Well, my HTML method is definitely faster: Save - open in Notepad - do a handful of search&replace operations - load the cleaned HTML with your word processor of choice. No more font garbage - just nice clean text. But I have to agree about RTF. It could be such a nice format if it hadn't been overloaded with all sorts of crap which modern word processors just thoughtlessly add to the file.

12-12-2011, 08:37 AM	#9
GMcG Writer Posts: 101 Karma: 590630 Join Date: Mar 2011 Location: Munich, Germany Device: none	To Ripplinger: In a word processor like Word you may 'select all' and then choose the font you want. By this you will get only one single font for the whole document and the bold and italic parts are still there. George

Advert

Advert