Quote:
Originally Posted by pepak
Personally, I think that the human way (you read it and decide if a paragraph should or shouldn't be there) the easiest. Most of the time, anyway. You can't avoid proofreading the OCRed text anyway, so you can just as well do the paragraph thing at the same time.
|
I agree. You have to read the book anyway. But just detecting paragraph break at page breaks is rather fast, you can check the beginning of every page and check whether pages that start with uppercase are new paragraphs or not (the OCR software will probably treat all of them in the same way, you only have to look for those that are not correct).