MobileRead Forums - View Single Post

igorsk · 09-14-2009, 05:08 PM

Quote:

Originally Posted by Ralph Sir Edward

Igorsky, how does the Google Epub OCR conversion stack up to Finereader's?

Would it be easier to clean up Google's version or Finereader's?

I'm quite sure Finereader will beat Google for almost any partitcular book.
1) ABBYY has been working in this area for a long time and they have pretty clearly the best commercial engine on the market. I'm not sure what Google is using (probably Tesseract) but they have been at it only for a couple of years. I admit Google has their own pool of PhDs so the situation might change in the future.
2) Google cannot afford to manually tune the OCR for each book; the volume is just too big. With Finereader you can check the results and adjust settings as needed, that can help quite a lot.

Quote:

Originally Posted by ahi

I have this notion in my head...

What about taking a given document, OCR-ing it with at least 3 or more different OCR programs, and then parallel parsing them character by character (perhaps now and then making and adjustment, if one of the streams is out of line do to an erroneously detect additional character) and always putting the character into the output stream that the (most) OCR-d texts agree on.

Obviously this won't help with anything that the various OCR programs get wrong in the same way... but it might minimize the amount of clean-up to be done thereafter.

How realistic is such an approach? Anybody here tried it before?

Actually, yes. I've seen a couple of papers and it seems it does help, though setting it up is probably not trivial.