MobileRead Forums - View Single Post - PDF Image -> OCR -> text

elegant · 07-06-2009, 06:35 PM

PDF -> OCR - > Text.

This seems like the best option.

I have alot of major problems with standard text extraction from PDF and then using BookDesigner. For example, omission of italics and no flexibility in how to format paragraphs or get rid of line break problems.

The question is to find good OCR software. And find a good intermediary program/format in which to reformat the text.

Ideally I think I'd find it easiest to use OOo or Word for formatting, then convert the DOC or HTML to LRF.

Unfortunately the BookDesigner way is the only one with a reasonable amount of documentation, which I don't find to be flexible or to produce particularly elegant results.

07-06-2009, 06:35 PM	#7
elegant Member Posts: 12 Karma: 10 Join Date: Apr 2009 Device: sony reader	PDF -> OCR - > Text. This seems like the best option. I have alot of major problems with standard text extraction from PDF and then using BookDesigner. For example, omission of italics and no flexibility in how to format paragraphs or get rid of line break problems. The question is to find good OCR software. And find a good intermediary program/format in which to reformat the text. Ideally I think I'd find it easiest to use OOo or Word for formatting, then convert the DOC or HTML to LRF. Unfortunately the BookDesigner way is the only one with a reasonable amount of documentation, which I don't find to be flexible or to produce particularly elegant results.