Thread: OCR engine
View Single Post
Old 04-29-2014, 09:06 PM   #52
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by cadele View Post
Now that I have Abbyy to do the OCR it has cut down enormously on the proofing, but it still takes ages. I make a special point not to calculate how many hours this takes me.
You should try to keep track of hours, it is quite interesting seeing how much faster/better you get at creating the ebooks.

As I mentioned, it used to take me two weeks of work to go from PDF -> finished EPUB, now I pump out the typical non-fiction economics book in ~8-15 hours.

Side Note: I have a bunch of stats I have been gathering, maybe when I get some more free time I will create a topic on MobileRead showing off the "research". Haven't touched the spreadsheets since March (and still have a ton more info to add to it).

Here is a preview of the Hours to convert + word count of books since I started keeping in-depth track of my hours (~October 2012):

Click image for larger version

Name:	HourstoConvert.png
Views:	212
Size:	6.3 KB
ID:	122328 Click image for larger version

Name:	TotalWordsPerBook.png
Views:	242
Size:	8.7 KB
ID:	122329

and here is the word count of all books I have converted to EPUB:

Click image for larger version

Name:	TotalWordsPerBook.(All.Encompassing).png
Views:	240
Size:	8.7 KB
ID:	122331

Quote:
Originally Posted by cadele View Post
What I really need (after a good duplex scanner) is a cheat sheet of regex to cut down the proofing. Unfortunately I struggle with that - my mind is Teflon when it comes to regex
What is your current process.

Are you just using Finereader to OCR and output to DOC, and then do your proofing there? If you use Microsoft Word, your best bet would probably be to use Toxaris's tools: https://www.mobileread.com/forums/sho...d.php?t=213372

Or are you fixing mistakes in Finereader beforehand (this is my method, since it is very easy to A/B compare). Then doing your more thorough checking elsewhere? (I personally export from Finereader -> EPUB -> Sigil, and then do all the regex/fixing + final spellchecking there).

Quote:
Originally Posted by AJ Starr View Post
It says it scans to OCR (which my current all-in-one does not)
The disadvantage of using the OCR that comes with the device is that they will be using old/obsolete versions of the software.

For example, if you bought a scanner from Year ####, the scanner might come with Adobe Acrobat 7's OCR. (Since the scanner was made, versions 8+ have come out).

Same with the OCRed documents off of Archive.org, they OCR the book at the time of submission (so lets say the book was scanned in 2007, it would be using whatever version of Finereader was around in 2007).

Newer versions of the OCR software most likely have more accurate hyphenation/layout/page/table algorithms, larger dictionaries, more accurate recognition of font/accents/italics/bold/superscript/subscript, etc. etc.

If you wanted more accuracy, your best bet would just be rerunning the documents through whatever the newest version is of the software. So for Archive.org, downloading the source document and re-OCR it using Finereader 11 or 12 will give you a much better starting point.

Last edited by Tex2002ans; 04-29-2014 at 09:18 PM.
Tex2002ans is offline   Reply With Quote