MobileRead Forums - View Single Post

Greg Anos · 09-14-2009, 03:55 PM

Quote:

Originally Posted by ahi

I have this notion in my head...

What about taking a given document, OCR-ing it with at least 3 or more different OCR programs, and then parallel parsing them character by character (perhaps now and then making and adjustment, if one of the streams is out of line do to an erroneously detect additional character) and always putting the character into the output stream that the (most) OCR-d texts agree on.

Obviously this won't help with anything that the various OCR programs get wrong in the same way... but it might minimize the amount of clean-up to be done thereafter.

How realistic is such an approach? Anybody here tried it before?

- Ahi

The idea is excellent, but I don't know of anybody who has written flexible parsing software. As a matter of fact, the idea could be used for any ocr'ed texts...

Big problem will be with differences in the embedded control sequences...