![]() |
#1 |
350 Hoarder
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,574
Karma: 8281267
Join Date: Dec 2010
Location: Midwest USA
Device: Sony PRS-350, Kobo Glo & Glo HD, PW2
|
How to eliminate 1000 fonts in epub
I scanned a book and then used MS Word to do all the proof-reading and editing, saved as html and then am using Sigil to clean it up. I have a feeling MS Word was the wrong program to use. It first put everything into text boxes (which was a pain to get rid of those), but it also added hundreds of various fonts and styles from the scan which I would like to eliminate. I don't mind a few pages having special fonts, like the table of contents or the publishing page, but throughout the body of the book I'd like just one font.
Is there an easy way to trim all the font classes and font styles that are in the epub? I originally was just going to copy/paste it back with no formatting, but that method also killed any italics, and there's a lot throughout the book and important to me to keep them as italics (without having to read line-by-line through the entire book again to find them to italicize, that would be painful, there's that many of them). Any suggestions how to easily do this? I did try letting Calibre convert epub to epub, and even that stylesheet was over 2000 lines long. The size of the epub would probably be cut in half as well, it's at 480KB now. The epub does work, even linking the table of contents takes you to the correct chapter, but it has very slight changes of the font and spacing even on the same page throughout the book. Thanks for any suggestions. |
![]() |
![]() |
![]() |
#2 |
You kids get off my lawn!
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,220
Karma: 73492664
Join Date: Aug 2007
Location: Columbus, Ohio
Device: Oasis 2 and Libra H2O and half a dozen older models I can't let go of
|
Did you save it as simple HTML? I may not be using the right terminology, but I know it's recommended when using Word for HTML to save it with the stripped/simple version (that strips out all the custom Office codes).
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
You need to save it as filtered HTML. But, if you have everything in text-boxes, something else did not exactly worked right.
Anyway, those font settings can be removed quite quickly with various RegExp search and replaces. |
![]() |
![]() |
![]() |
#4 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
Word didn't put the text in text boxes. FineReader did, when you exported as "Exact Copy".
Word didn't add those styles either. FineReader v10 (don't know about version 11 yet) creates separate styles for bold and italic characters, and even MULTIPLE styles for the exact same bold or italic characters (instead of just using a previous style). Come to think of it, it's probably because the pages weren't perfectly flat when they were scanned, and/or the author sparsely used a different character spacing for various lines (so that they fit on the page). Hmmm... So it may not be FineReader's fault after all, for all these "ghost" styles... However, there is a way to export to plain .txt without losing all the bold, italics and any other formatting you'd like, which would essentially eliminate all the styles. Check out this thread: https://www.mobileread.com/forums/sho...d.php?t=153764 Last edited by DSpider; 12-07-2011 at 08:17 AM. |
![]() |
![]() |
![]() |
#5 | |
350 Hoarder
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,574
Karma: 8281267
Join Date: Dec 2010
Location: Midwest USA
Device: Sony PRS-350, Kobo Glo & Glo HD, PW2
|
Quote:
Thanks for all the info about where I went wrong, it probably was a setting in FineReader as mentioned. I did try saving in filtered html btw, but the 1000 fonts were still there, it didn't strip it of that formatting for reason. |
|
![]() |
![]() |
Advert | |
|
![]() |
#6 | |
Member
![]() Posts: 10
Karma: 10
Join Date: Dec 2011
Device: none
|
Quote:
FineReader's page formatting is pretty shitty for the most part. I've yet to see a single document I scanned where it produced good results. I suffered through too much trouble before finding out that saving to a file format that doesn't have these obnoxious textboxes is really the only way to get a decent document. What I do most of the time is to save as HTML, then load the saved document in a text editor (Notepad++) and remove the stylesheet references by looking for all '<span class=' occurences and replacing them until none are left. I trust this method much more than relying on word processing software to keep a clean file and can be certain to be rid of all font changes FineReader inserts. After this process the only thing left will be italic, bold and similar effects. |
|
![]() |
![]() |
![]() |
#7 | |
space cadet
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 333
Karma: 2999999
Join Date: Aug 2007
Location: Seattle area
Device: Rocket PRO, gen3, Pocketbook360
|
Quote:
When I worked at MS as a test contractor, we called the Wordpad version of RTF the "true" RTF, and the Word version "Woozle" (dunno why, that's just what everyone else called it.) It had extended features of RTF, that was the MS way of trying to make it sorta proprietary and not truly compatible with other applications' implementation. Even now, if you can find a version of Wordpad on your computer (look in the Accessories start menu), you should be able have Word save as an RTF, open that RTF with Wordpad (which will ignore the features it doesn't understand), then save as again from Wordpad to a different RTF filename. Voila, extra strange formatting gone, since Wordpad doesn't do styles. |
|
![]() |
![]() |
![]() |
#8 |
Member
![]() Posts: 10
Karma: 10
Join Date: Dec 2011
Device: none
|
Well, my HTML method is definitely faster:
Save - open in Notepad - do a handful of search&replace operations - load the cleaned HTML with your word processor of choice. No more font garbage - just nice clean text. But I have to agree about RTF. It could be such a nice format if it hadn't been overloaded with all sorts of crap which modern word processors just thoughtlessly add to the file. |
![]() |
![]() |
![]() |
#9 |
Writer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 101
Karma: 590630
Join Date: Mar 2011
Location: Munich, Germany
Device: none
|
To Ripplinger:
In a word processor like Word you may 'select all' and then choose the font you want. By this you will get only one single font for the whole document and the bold and italic parts are still there. George |
![]() |
![]() |
![]() |
#10 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
I still vote working your way up from plain text (without losing the bolds and italics) from Word:
1. Using FineReader export to DOCX as "Formatted Text". For illustrative purposes, here's a screenshot of FineReader 11 (they've added icons now, no more confusion resulting in thousands of text boxes): ![]() 2. Use the method I mentioned earlier to save as Plain Text, which will give you a squeaky clean document with bolds and italics, and do the layout - I always use my own customized 'Quick Styles' here. Much faster. 3. Export as HTML (maybe try Filtered HTML). Then go from there. Last edited by DSpider; 12-12-2011 at 09:44 AM. |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to merge and eliminate duplicates | clittle | Library Management | 9 | 02-07-2011 04:33 PM |
Classic Eliminate margins | primetime34 | Barnes & Noble NOOK | 6 | 12-26-2010 10:07 PM |
eliminate iphone glare | scottjl | Apple Devices | 2 | 04-29-2010 10:05 PM |
ePub embedded fonts | JSWolf | Ectaco jetBook | 9 | 09-14-2009 08:43 PM |
Buy a 1000 Base and turn it into a 1000 S? | doctorow | iRex | 5 | 09-24-2008 02:14 AM |