Best practice to OCR and convert PDF to text or html or epub - Page 2

ProDigit · 12-14-2015, 08:00 PM

I'm entering the conversation quite late.
I let it run through the OCR, and use Notepad ++.
From within notepad, depending on how many scans you do, you can create macros to start removing errors.

I have about 5 type of older books (black text on a yellowed out paper). I noticed the scanner makes repetitive mistakes, like changing "I" to "L", or "are" to "ame" or something.

Notepad ++ has a very advanced "search and replace" option. Once I start reading the book on the top, and I find an error (say it wrote "plumtree" as "plumlree"), I will search and replace (*lree to *tree). That way, it will replace future 'plumlrees' as well as future 'applelrees', or 'pearlrees'.
Doing a few of the same books at a time, there you can learn your OCR's errors, and map em in a macro.
Write the macro, apply it on the book before you're even correcting it.
When you're starting with different sources on an OCR program, this method will not work very well, or not at all.
It mainly only works when you manually scan books from one and the same scanner, usually at the same resolutions.

For low resolution scans like above, I would recommend trying to download a text copy of the book, load it side by side with the picture, and manually apply corrections, or modifications on the text format; as the only alternative to correcting a rather lousy OCR conversion (which, no matter what software you get, the conversion probably will look bad regardless).

12-14-2015, 08:00 PM	#16
ProDigit Karmaniac Posts: 2,553 Karma: 11499146 Join Date: Oct 2008 Location: Miami FL Device: PRS-505, Jetbook, + Mini, +Color, Astak Ez Reader Pro, PPW1, Aura H2O	I'm entering the conversation quite late. I let it run through the OCR, and use Notepad ++. From within notepad, depending on how many scans you do, you can create macros to start removing errors. I have about 5 type of older books (black text on a yellowed out paper). I noticed the scanner makes repetitive mistakes, like changing "I" to "L", or "are" to "ame" or something. Notepad ++ has a very advanced "search and replace" option. Once I start reading the book on the top, and I find an error (say it wrote "plumtree" as "plumlree"), I will search and replace (lree to tree). That way, it will replace future 'plumlrees' as well as future 'applelrees', or 'pearlrees'. Doing a few of the same books at a time, there you can learn your OCR's errors, and map em in a macro. Write the macro, apply it on the book before you're even correcting it. When you're starting with different sources on an OCR program, this method will not work very well, or not at all. It mainly only works when you manually scan books from one and the same scanner, usually at the same resolutions. For low resolution scans like above, I would recommend trying to download a text copy of the book, load it side by side with the picture, and manually apply corrections, or modifications on the text format; as the only alternative to correcting a rather lousy OCR conversion (which, no matter what software you get, the conversion probably will look bad regardless). Last edited by ProDigit; 12-14-2015 at 08:03 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Best practice to convert PDF to simple flowing text? Calibre error	avid01	PDF	6	03-31-2017 03:47 AM
Best practice to convert framed HTML to e-reader readable format?	avid01	Workshop	12	06-07-2015 06:03 AM
Convert EPUB to HTML Zip extra meta text	meme	Conversion	2	05-28-2012 01:34 PM