Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 12-06-2011, 01:20 AM   #1
Ripplinger
350 Hoarder
Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.
 
Ripplinger's Avatar
 
Posts: 3,574
Karma: 8281267
Join Date: Dec 2010
Location: Midwest USA
Device: Sony PRS-350, Kobo Glo & Glo HD, PW2
How to eliminate 1000 fonts in epub

I scanned a book and then used MS Word to do all the proof-reading and editing, saved as html and then am using Sigil to clean it up. I have a feeling MS Word was the wrong program to use. It first put everything into text boxes (which was a pain to get rid of those), but it also added hundreds of various fonts and styles from the scan which I would like to eliminate. I don't mind a few pages having special fonts, like the table of contents or the publishing page, but throughout the body of the book I'd like just one font.

Is there an easy way to trim all the font classes and font styles that are in the epub? I originally was just going to copy/paste it back with no formatting, but that method also killed any italics, and there's a lot throughout the book and important to me to keep them as italics (without having to read line-by-line through the entire book again to find them to italicize, that would be painful, there's that many of them).

Any suggestions how to easily do this? I did try letting Calibre convert epub to epub, and even that stylesheet was over 2000 lines long. The size of the epub would probably be cut in half as well, it's at 480KB now. The epub does work, even linking the table of contents takes you to the correct chapter, but it has very slight changes of the font and spacing even on the same page throughout the book.

Thanks for any suggestions.
Ripplinger is offline   Reply With Quote
Old 12-06-2011, 03:31 AM   #2
FizzyWater
You kids get off my lawn!
FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.
 
FizzyWater's Avatar
 
Posts: 4,220
Karma: 73492664
Join Date: Aug 2007
Location: Columbus, Ohio
Device: Oasis 2 and Libra H2O and half a dozen older models I can't let go of
Did you save it as simple HTML? I may not be using the right terminology, but I know it's recommended when using Word for HTML to save it with the stripped/simple version (that strips out all the custom Office codes).
FizzyWater is offline   Reply With Quote
Advert
Old 12-06-2011, 03:57 AM   #3
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
You need to save it as filtered HTML. But, if you have everything in text-boxes, something else did not exactly worked right.

Anyway, those font settings can be removed quite quickly with various RegExp search and replaces.
Toxaris is offline   Reply With Quote
Old 12-06-2011, 04:36 AM   #4
DSpider
Evangelist
DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.
 
DSpider's Avatar
 
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
Word didn't put the text in text boxes. FineReader did, when you exported as "Exact Copy".

Word didn't add those styles either. FineReader v10 (don't know about version 11 yet) creates separate styles for bold and italic characters, and even MULTIPLE styles for the exact same bold or italic characters (instead of just using a previous style). Come to think of it, it's probably because the pages weren't perfectly flat when they were scanned, and/or the author sparsely used a different character spacing for various lines (so that they fit on the page). Hmmm... So it may not be FineReader's fault after all, for all these "ghost" styles...

However, there is a way to export to plain .txt without losing all the bold, italics and any other formatting you'd like, which would essentially eliminate all the styles. Check out this thread: https://www.mobileread.com/forums/sho...d.php?t=153764

Last edited by DSpider; 12-07-2011 at 08:17 AM.
DSpider is offline   Reply With Quote
Old 12-06-2011, 09:32 AM   #5
Ripplinger
350 Hoarder
Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.
 
Ripplinger's Avatar
 
Posts: 3,574
Karma: 8281267
Join Date: Dec 2010
Location: Midwest USA
Device: Sony PRS-350, Kobo Glo & Glo HD, PW2
Quote:
Originally Posted by DSpider View Post
Thank you! That did it in the least painless way. I'll still have to do some manual formatting and break it back up into separate chapters, but with 838 italics found over 280+ printed pages (390 pages on my reader), you saved my sanity.

Thanks for all the info about where I went wrong, it probably was a setting in FineReader as mentioned. I did try saving in filtered html btw, but the 1000 fonts were still there, it didn't strip it of that formatting for reason.
Ripplinger is offline   Reply With Quote
Advert
Old 12-08-2011, 04:58 AM   #6
Karl Murks
Member
Karl Murks began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Dec 2011
Device: none
Quote:
Originally Posted by DSpider View Post
Word didn't put the text in text boxes. FineReader did, when you exported as "Exact Copy".

FineReader's page formatting is pretty shitty for the most part. I've yet to see a single document I scanned where it produced good results. I suffered through too much trouble before finding out that saving to a file format that doesn't have these obnoxious textboxes is really the only way to get a decent document.

What I do most of the time is to save as HTML, then load the saved document in a text editor (Notepad++) and remove the stylesheet references by looking for all '<span class=' occurences and replacing them until none are left. I trust this method much more than relying on word processing software to keep a clean file and can be certain to be rid of all font changes FineReader inserts. After this process the only thing left will be italic, bold and similar effects.
Karl Murks is offline   Reply With Quote
Old 12-09-2011, 01:27 AM   #7
Darqref
space cadet
Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.Darqref ought to be getting tired of karma fortunes by now.
 
Posts: 333
Karma: 2999999
Join Date: Aug 2007
Location: Seattle area
Device: Rocket PRO, gen3, Pocketbook360
Quote:
Originally Posted by Karl Murks View Post
FineReader's page formatting is pretty shitty for the most part. I've yet to see a single document I scanned where it produced good results. I suffered through too much trouble before finding out that saving to a file format that doesn't have these obnoxious textboxes is really the only way to get a decent document.

What I do most of the time is to save as HTML, then load the saved document in a text editor (Notepad++) and remove the stylesheet references by looking for all '<span class=' occurences and replacing them until none are left. I trust this method much more than relying on word processing software to keep a clean file and can be certain to be rid of all font changes FineReader inserts. After this process the only thing left will be italic, bold and similar effects.
I've been using an older copy of Omnipage 16, with my source a digital camera taking pictures of the pages. Omnipage has an export format called Wordpad(RTF). From experience, I know that Wordpad (an old application found in older versions of windows, and probably still buried in the newer ones) uses a "limited" version of RTF. The text comes out much cleaner, and standard procedures of selecting all text and making it the same font, etc., will tend to clean up the rest. Then, the rtf doc can be opened bu Word, for further text manipulation.

When I worked at MS as a test contractor, we called the Wordpad version of RTF the "true" RTF, and the Word version "Woozle" (dunno why, that's just what everyone else called it.) It had extended features of RTF, that was the MS way of trying to make it sorta proprietary and not truly compatible with other applications' implementation.

Even now, if you can find a version of Wordpad on your computer (look in the Accessories start menu), you should be able have Word save as an RTF, open that RTF with Wordpad (which will ignore the features it doesn't understand), then save as again from Wordpad to a different RTF filename. Voila, extra strange formatting gone, since Wordpad doesn't do styles.
Darqref is offline   Reply With Quote
Old 12-09-2011, 06:46 AM   #8
Karl Murks
Member
Karl Murks began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Dec 2011
Device: none
Well, my HTML method is definitely faster:
Save - open in Notepad - do a handful of search&replace operations - load the cleaned HTML with your word processor of choice. No more font garbage - just nice clean text.

But I have to agree about RTF. It could be such a nice format if it hadn't been overloaded with all sorts of crap which modern word processors just thoughtlessly add to the file.
Karl Murks is offline   Reply With Quote
Old 12-12-2011, 07:37 AM   #9
GMcG
Writer
GMcG ought to be getting tired of karma fortunes by now.GMcG ought to be getting tired of karma fortunes by now.GMcG ought to be getting tired of karma fortunes by now.GMcG ought to be getting tired of karma fortunes by now.GMcG ought to be getting tired of karma fortunes by now.GMcG ought to be getting tired of karma fortunes by now.GMcG ought to be getting tired of karma fortunes by now.GMcG ought to be getting tired of karma fortunes by now.GMcG ought to be getting tired of karma fortunes by now.GMcG ought to be getting tired of karma fortunes by now.GMcG ought to be getting tired of karma fortunes by now.
 
GMcG's Avatar
 
Posts: 101
Karma: 590630
Join Date: Mar 2011
Location: Munich, Germany
Device: none
To Ripplinger:

In a word processor like Word you may 'select all' and then choose the font you want. By this you will get only one single font for the whole document and the bold and italic parts are still there.

George
GMcG is offline   Reply With Quote
Old 12-12-2011, 07:52 AM   #10
DSpider
Evangelist
DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.
 
DSpider's Avatar
 
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
I still vote working your way up from plain text (without losing the bolds and italics) from Word:

1. Using FineReader export to DOCX as "Formatted Text". For illustrative purposes, here's a screenshot of FineReader 11 (they've added icons now, no more confusion resulting in thousands of text boxes):


2. Use the method I mentioned earlier to save as Plain Text, which will give you a squeaky clean document with bolds and italics, and do the layout - I always use my own customized 'Quick Styles' here. Much faster.

3. Export as HTML (maybe try Filtered HTML). Then go from there.

Last edited by DSpider; 12-12-2011 at 09:44 AM.
DSpider is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
How to merge and eliminate duplicates clittle Library Management 9 02-07-2011 04:33 PM
Classic Eliminate margins primetime34 Barnes & Noble NOOK 6 12-26-2010 10:07 PM
eliminate iphone glare scottjl Apple Devices 2 04-29-2010 10:05 PM
ePub embedded fonts JSWolf Ectaco jetBook 9 09-14-2009 08:43 PM
Buy a 1000 Base and turn it into a 1000 S? doctorow iRex 5 09-24-2008 02:14 AM


All times are GMT -4. The time now is 02:45 PM.


MobileRead.com is a privately owned, operated and funded community.