MobileRead Forums - View Single Post

Nergal · 05-26-2008, 04:17 PM

The paragraph detection is tricky with tesseract but (!) not complete hopeless, if the paragraphs are seperated by a blank line it might be detected and would be parsable as two linebreaks. Though if I understand you correctly, you do not want something to tinker with, but a solution that actually solves a task

).

I just made a test with a book page I scanned today (Caesar - Civil War, German) - 2124 Signs, only 2 signs which stood together were read falsely, scanned at 300 dpi.
So that is a rate of 99.91% (better than my typing

.

Recently I had the chance to see ReadIris (for free with an HP-All-in One Device) - its layout detection was really horrible - very distinct columns have been overlooked and a lot of simple misreadings.

I had some years ago AbbyyFineReader 8.0 Trial and must admit I was a bit disappointed at the automagical layout-detection, quite a lot of manual editing. Hopefully this works better by now.

IIRC they offer an educational discount ... - if I wasn't a Linux-addicted at that time I surely would have bought it because of its fantastic recognition rate, except for text written in italics.

a bt OT: Their language support was/is awesome on the other hand: scanned in a Russian article, parsed it through babelfish and got at least a vague idea what the author had written, whithout knowing much more than 'spassibo' of Russian language by myself

Good Luck!

Nergal

05-26-2008, 04:17 PM	#17
Nergal eBuchReisender Posts: 41 Karma: 208 Join Date: May 2008 Location: Münster Device: Palm Tungsten-E, iLiad	The paragraph detection is tricky with tesseract but (!) not complete hopeless, if the paragraphs are seperated by a blank line it might be detected and would be parsable as two linebreaks. Though if I understand you correctly, you do not want something to tinker with, but a solution that actually solves a task ). I just made a test with a book page I scanned today (Caesar - Civil War, German) - 2124 Signs, only 2 signs which stood together were read falsely, scanned at 300 dpi. So that is a rate of 99.91% (better than my typing . Recently I had the chance to see ReadIris (for free with an HP-All-in One Device) - its layout detection was really horrible - very distinct columns have been overlooked and a lot of simple misreadings. I had some years ago AbbyyFineReader 8.0 Trial and must admit I was a bit disappointed at the automagical layout-detection, quite a lot of manual editing. Hopefully this works better by now. IIRC they offer an educational discount ... - if I wasn't a Linux-addicted at that time I surely would have bought it because of its fantastic recognition rate, except for text written in italics. a bt OT: Their language support was/is awesome on the other hand: scanned in a Russian article, parsed it through babelfish and got at least a vague idea what the author had written, whithout knowing much more than 'spassibo' of Russian language by myself Good Luck! Nergal