MobileRead Forums - View Single Post

Nergal · 05-26-2008, 06:39 AM

For the inital question: I recommend to have a look at tesseract ocr - it is an opensource command line tool - with an amazing recognition rate (95-99.9 %, mostly at 98-99% for me). It was developed by HP back in the mid 90's and is now based at google.

http://code.google.com/p/tesseract-ocr/

It is a bit rough to use, but english and german and several others are supported so far - I do not know wether there are differences between the languages in the result quality.

It has NO layout recognition - give it a simple grayscale tif-image (no compression) and it'll spit out UTF-8 encoded plain text with line ends.

ATM I program a little Python/Qt tool to create eBooks from my paperbacks, which runs with my Epson Flatbed USB quite well.

Have a look at the post in my blog (It's German, but simply click on the download link in the post if you cannot understand the text

) - I had no time yet to write a manual and some options are still missing int he gui (have look in the bookscan.py-file at line 106 - it will scan by default 2 pages from a book (I recommend Reclam or Penguin books for the first testing, since no rotation is implemented yet), so set the maxpages-value to half the amount of bookpages you want to have.

With a preview the app is horribly slow - if you really want to scan a whole book, have the appropriate values for the part to be scanned at hand in mm. So far the two pages are simply separated from each other by saving the right and the left half of the scanned image into separate files.

Huh ... send me an email (nergal[ät]monasteriaobscura[punkt]de if anything is weird.

The version is, well something below 0.1a

Cheers,
Nergal

05-26-2008, 06:39 AM	#15
Nergal eBuchReisender Posts: 41 Karma: 208 Join Date: May 2008 Location: Münster Device: Palm Tungsten-E, iLiad	For the inital question: I recommend to have a look at tesseract ocr - it is an opensource command line tool - with an amazing recognition rate (95-99.9 %, mostly at 98-99% for me). It was developed by HP back in the mid 90's and is now based at google. http://code.google.com/p/tesseract-ocr/ It is a bit rough to use, but english and german and several others are supported so far - I do not know wether there are differences between the languages in the result quality. It has NO layout recognition - give it a simple grayscale tif-image (no compression) and it'll spit out UTF-8 encoded plain text with line ends. ATM I program a little Python/Qt tool to create eBooks from my paperbacks, which runs with my Epson Flatbed USB quite well. Have a look at the post in my blog (It's German, but simply click on the download link in the post if you cannot understand the text ) - I had no time yet to write a manual and some options are still missing int he gui (have look in the bookscan.py-file at line 106 - it will scan by default 2 pages from a book (I recommend Reclam or Penguin books for the first testing, since no rotation is implemented yet), so set the maxpages-value to half the amount of bookpages you want to have. With a preview the app is horribly slow - if you really want to scan a whole book, have the appropriate values for the part to be scanned at hand in mm. So far the two pages are simply separated from each other by saving the right and the left half of the scanned image into separate files. Huh ... send me an email (nergal[ät]monasteriaobscura[punkt]de if anything is weird. The version is, well something below 0.1a Cheers, Nergal