View Single Post
Old 02-12-2013, 09:31 AM   #8
steven522
binomial: homo legentem
steven522 ought to be getting tired of karma fortunes by now.steven522 ought to be getting tired of karma fortunes by now.steven522 ought to be getting tired of karma fortunes by now.steven522 ought to be getting tired of karma fortunes by now.steven522 ought to be getting tired of karma fortunes by now.steven522 ought to be getting tired of karma fortunes by now.steven522 ought to be getting tired of karma fortunes by now.steven522 ought to be getting tired of karma fortunes by now.steven522 ought to be getting tired of karma fortunes by now.steven522 ought to be getting tired of karma fortunes by now.steven522 ought to be getting tired of karma fortunes by now.
 
steven522's Avatar
 
Posts: 1,061
Karma: 25222222
Join Date: Feb 2010
Location: Alabama, USA
Device: iriver Story HD; Archos 80 G9
Quote:
Originally Posted by mr ploppy View Post
Also you missed out the most important step at the end, proof reading. OCR, even when done by someone with computer skills, introduces errors. Wrong words, random added characters, etc.
True. Even the best OCR software in the world on the most crisp and clean scan ever will still throw up oddities like /' instead of ," and other misreads.


I have scanned a lot of old books and found that the most direct route is:

- Get a scanner that will let you scan the open, face-down book and get both pages at once. Go through the entire book this way from cover to cover and you only have half as many scans as pages.

- Use Scan Tailor to process the images. It will rotate, split, deskew, clean and output nice clear TIFF images. At this point, depending on how much further you want to go, you can assemble the TIFF files into a PDF worth storing and reading on an ereader. If you are willing to have just a copy of the book as-is (no reflow or text adjusting).

- Process the cleaned images through an OCR program. There are many out there and none of them are 100% on the conversion. Depending on the program and the output file available, I would choose something that retains the original formatting as much as possible while still allowing ease of editing. My person preference is to output html code.

- Proofread the output file. Use your favorite editor to read through the file and cross check against the original book or scan. I find it easier to open the scanned images in one window and the OCR edit screen in another side-by-side and then just browse through it. Double-check any strange output against the scan, correcting as you go.

- Assemble your final ebook. I prefer epub and use Sigil to create the final ebook. I add the cover scan, break out chapters, do final formatting and run epub checks before outputting the finished ebook. I have also aken the final html code and processed it through the Amazon system (email the single html file to your kindle) to get a mobi file. This was really just to see what happened and not really my routine.
steven522 is offline   Reply With Quote