Thread: OCR engine
View Single Post
Old 05-04-2014, 04:15 AM   #57
cadele
Addict
cadele ought to be getting tired of karma fortunes by now.cadele ought to be getting tired of karma fortunes by now.cadele ought to be getting tired of karma fortunes by now.cadele ought to be getting tired of karma fortunes by now.cadele ought to be getting tired of karma fortunes by now.cadele ought to be getting tired of karma fortunes by now.cadele ought to be getting tired of karma fortunes by now.cadele ought to be getting tired of karma fortunes by now.cadele ought to be getting tired of karma fortunes by now.cadele ought to be getting tired of karma fortunes by now.cadele ought to be getting tired of karma fortunes by now.
 
cadele's Avatar
 
Posts: 372
Karma: 3710372
Join Date: Feb 2010
Device: Kindles, Sony 650
Red face

Quote:
Originally Posted by Tex2002ans View Post
You should try to keep track of hours, it is quite interesting seeing how much faster/better you get at creating the ebooks.

. . .

What is your current process.

Are you just using Finereader to OCR and output to DOC, and then do your proofing there? If you use Microsoft Word, your best bet would probably be to use Toxaris's tools: https://www.mobileread.com/forums/sho...d.php?t=213372

Or are you fixing mistakes in Finereader beforehand (this is my method, since it is very easy to A/B compare). Then doing your more thorough checking elsewhere? (I personally export from Finereader -> EPUB -> Sigil, and then do all the regex/fixing + final spellchecking there).



The disadvantage of using the OCR that comes with the device is that they will be using old/obsolete versions of the software.

For example, if you bought a scanner from Year ####, the scanner might come with Adobe Acrobat 7's OCR. (Since the scanner was made, versions 8+ have come out).

Same with the OCRed documents off of Archive.org, they OCR the book at the time of submission (so lets say the book was scanned in 2007, it would be using whatever version of Finereader was around in 2007).

Newer versions of the OCR software most likely have more accurate hyphenation/layout/page/table algorithms, larger dictionaries, more accurate recognition of font/accents/italics/bold/superscript/subscript, etc. etc.

If you wanted more accuracy, your best bet would just be rerunning the documents through whatever the newest version is of the software. So for Archive.org, downloading the source document and re-OCR it using Finereader 11 or 12 will give you a much better starting point.

I am going to start to keep some stats - you have inspired me!

My process has improved a bit. I now cut the spine off the book and run the pages through the scansnap (unless I want to preserve the book, in which case it is the dreaded flatbed scanner at work during my lunch break).

Then I open the file in Abbyy Finereader 12 and verify the text. This is slow but worth it. I then convert it to a Word document. Following that I set up my page size and layout. I usually try to match the book's general layout without being too OCD about it.

Then I start reading and correcting. I do run a list of search and replace for common OCR errors that I have come up against. Once I finish that I will use Word's spellcheck just to pick up what I have missed.

Then I add a TOC - the Stone Age way by inserting bookmarks then hyperlinks (I must learn how to do this in Calibre, it's getting ridiculous!).

Finally I add the book to Calibre, download the metadata and add the cover, then convert it to EPub and Mobi (both types of Mobi).

Oh, and then I back it up. Thud.
cadele is offline   Reply With Quote