View Single Post
Old 04-02-2010, 12:42 PM   #3
71117c
Junior Member
71117c began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Mar 2010
Device: PB 301
Quote:
Originally Posted by frabjous View Post
Out of curiousity, what OCR program were you using?

I've been meaning to try out various OCR options for linux (Tessaract, Cuneiform, Ocropus, etc.), but just haven't found time for it. If I get around to it, I'd be happy to swap notes.
i've been using cuneiform, since tesseract only produces plain text and i couldn't get ocropus to work. had problems compiling it on archlinux.

my workflow was as follows:
[.) convert the pdf to tiff with ghostscript]
.) run cuneiform on the tiffs -> html output. one file per tiff.
.) merged the html files into a single xhtml file.
.) used vim to remove page numbers (if any), remove hyphenations,...
.) and finally pasted the file into sigil to insert a cover, add chapters, ...

character recognition of cuneiform is pretty good. sometimes italic characters weren't detected as such (so no <i> tags around them in the html output) and also the problem mentioned in my first post concerning paragraphs that span over two pages (but this is due to the way cuneiform operates ...)


Quote:
Originally Posted by frabjous View Post
A lot of people around here rave about ABBYY finereader (Windows) for OCR, and I'd imagine it can handle both these desiderata, though as an open source enthusiast, I always like to see what other options are available before moving to commercial tools.
there is even a commandline tool from abbyy for linux (http://www.ocr4linux.com/). a trial version can be downloaded, but i haven't tried it yet. i also prefer open source ...
71117c is offline   Reply With Quote