View Single Post
Old 12-11-2017, 06:46 PM   #12
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Hitch View Post
My business does this every single day, and there's no magic, fast way. The only way to get there from here is to a) use Abbyy Fine Reader, which at least will remove the running heads/footers, and b) do the rest by hand. I wish there were a faster/better way, but there isn't.
Yep, I agree completely. PDF is an awful input source, and there is no real good way besides a lot of human elbow grease.

Quote:
Originally Posted by Hitch View Post
Honestly, I've never noticed any sturm und drang retaining italics with Abbyy. Using Abbyy is, as you rightly say, a PITA, overall, but in my opinion, compared to all the other methods to get a PDF to an editable form, it's the best way to go. Maybe I'm just lucky that way, that we get a lot of books that are bold/italics laden, but we do.
Or as long as you have Finereader 10 or higher, it has EPUB output. The EPUB output has relatively clean code with only a handful of inline styles.

I have 12 regex to help clean up after (5 to handle Italics, Bold, BoldItalics, Smallcaps, BoldSmallcaps + 7 to just clean up some <td> and some other anomalies).

Example:

Search: <span style="font-style:italic;">
Replace: <span class="italics">

Quote:
Originally Posted by RobertDDL View Post
Yes, you are right about all formatting being lost. For novels, in most cases that would be headings and italics. [...] The loss of italics can be quite unfortunate, if the writer relies on them - not all of them do, though.
Getting the OCR correct is just a portion of the work. Formatting is just as important (and is where a lot of other tools fail miserably).

I've written about this in-depth before:

https://www.mobileread.com/forums/sh...72#post2883972

Most of your time is going to be spent editing and correcting the text/formatting, so the better you can get the input, the easier/faster your life will be on those later steps.

Last edited by Tex2002ans; 12-11-2017 at 06:50 PM.
Tex2002ans is offline   Reply With Quote