MobileRead Forums - View Single Post

Darqref · 12-29-2009, 11:28 PM

Quote:

Originally Posted by DMcCunney

I don't think the problem was Omnipage, though I agree there are better things out there.

One of the issues you face is that each page will be a scanned image, which will be seperately OCRed, and the results saved to different files. Those files must be combined to be a complete book, and there the trouble starts. You're running into the divisions between files.

Which version of Word are you running, and how were you bringing the files into Word to edit? The last time I had to OCR stuff, I wasn't concerned with keeping formatting, and saved to text. I combined the text files from the command line (copy file1.txt+file2,txt+file3.txt... newfile.txt), then brought that into my preferred text editor for cleanup. (I generally use Notepad++, but have a dozen or so others installed. Among other things, I maintain a wiki devoted to text editors, and am always looking at new ones. See http://TextEditors.org)
______
Dennis

I use an old but full version of Omnipage 16. Carefully check the options for the save formats, there are frequently more than one version of RTF. I use an RTF formatted for WordPad (the cheapo word processor installed on older Windows). I know from experience that Word does not use RTF - they use and internal EXTENSION of RTF that was known by testers as "woozle" for some gawd-awful reason. Saving to a version that is limited to be compatible with WordPad strips out the annoying text boxes, etc.

AFter that, its easier to start the proofreading stage.

In my opinion, if you are intending to do any serious amount of OCR work, then purchasing a full version of the OCR engine of your choice will be best. The clipped versions that ship with the scanner are only intended to be good enough to entice you to buy the full version.

On my full version of Omnipage, I can load multiple files at once, which will then be recognized as one document for OCR processing. I'm still annoyed by how emdashes are sometimes there and sometimes not, and I still get a lot of trouble with Capital I and lowercase l. And somebody, please make a spellcheck engine that will suggest a replacement because of a likely OCR error instead of a likely typing error!