MobileRead Forums - View Single Post - OCRopus

Bob Russell · 10-24-2007, 12:01 PM

Google officially released the alpha version of their open source OCR software yesterday. ArsTechnica has more of the technical details and a hands on review.

"Google's involvement in the project is motivated by the company's interest in digitizing printed documents. Open-source OCR technology could be valuable in many other contexts as well. Government agencies that want to digitize paper records, for instance, could one day benefit from OCRopus. Although OCRopus is weak in many areas, it has some real potential."

In terms of current quality, "OCRopus was able to provide readable output in about half of our tests." You can see more details in the ArsTechnica article, but it sounds like they have some work to do. Not sure if the beta expected in 2008Q1 addresses accuracy or not, but this is probably just the beginning.

Some if the tech tidbits shared:
* Built on HP's open-source Tesseract OCR engine
* Released under Apache License 2.0
* OpenFST library is used for language modeling
* Designed to be modular - to allow future support for non-Latin languages
* Developed in Lua

10-24-2007, 12:01 PM	#1
Bob Russell Recovering Gadget Addict Posts: 5,381 Karma: 676161 Join Date: May 2004 Location: Pittsburgh, PA Device: iPad	OCRopus - Google's open source Linux software Google officially released the alpha version of their open source OCR software yesterday. ArsTechnica has more of the technical details and a hands on review. "Google's involvement in the project is motivated by the company's interest in digitizing printed documents. Open-source OCR technology could be valuable in many other contexts as well. Government agencies that want to digitize paper records, for instance, could one day benefit from OCRopus. Although OCRopus is weak in many areas, it has some real potential." In terms of current quality, "OCRopus was able to provide readable output in about half of our tests." You can see more details in the ArsTechnica article, but it sounds like they have some work to do. Not sure if the beta expected in 2008Q1 addresses accuracy or not, but this is probably just the beginning. Some if the tech tidbits shared: * Built on HP's open-source Tesseract OCR engine * Released under Apache License 2.0 * OpenFST library is used for language modeling * Designed to be modular - to allow future support for non-Latin languages * Developed in Lua