MobileRead Forums - View Single Post

DMcCunney · 12-29-2009, 06:15 PM

Quote:

Originally Posted by kazbates

@Dennis ~ I scanned the book on a Canon DR-2510 using the OmniPage OCD software (very limited in scope) bundled with the scanner. It placed the scanned document into a .opd file but I could have saved it as a pdf or tiff (I think there was one other format) using just the scanner software. After the OCD editing process, it gave me the option of saving as a doc, rtf or txt. When saving as a doc or rtf and then opening in Word for further editing, it placed each scanned page in a textbox. Saving as txt completely strips any formating. My thought was to edit in Word and then resave as an html file per HarryT's guide. When I tried to save the textboxed version of the doc file as an html, it stripped all the formating just as if it was a txt file. I'm thinking that the problem lies with the limited version of the Omnipage app but it could very easily be "operator error".

I don't think the problem was Omnipage, though I agree there are better things out there.

One of the issues you face is that each page will be a scanned image, which will be seperately OCRed, and the results saved to different files. Those files must be combined to be a complete book, and there the trouble starts. You're running into the divisions between files.

Which version of Word are you running, and how were you bringing the files into Word to edit? The last time I had to OCR stuff, I wasn't concerned with keeping formatting, and saved to text. I combined the text files from the command line (copy file1.txt+file2,txt+file3.txt... newfile.txt), then brought that into my preferred text editor for cleanup. (I generally use Notepad++, but have a dozen or so others installed. Among other things, I maintain a wiki devoted to text editors, and am always looking at new ones. See http://TextEditors.org)
______
Dennis