View Single Post
Old 04-02-2010, 02:17 PM   #4
pepak
Guru
pepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura about
 
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
Quote:
Originally Posted by 71117c View Post
to get an idea how tedious this process would be, i converted books, that i have in pdf format on my pc, to tiff and then ran some linux ocr software on them.
i encountered two issues, which would slow down the conversion process tremendously, if i can't solve them:
And here you are at the core of the matter: Scanning a book is easy enough, even a non-destructive scan goes quite quickly. But getting a usable text output out of scanned images is a lot of work - and not only those two issues you mention.

Quote:
1) detection of italic fonts.
That must be a problem with your OCR software. FineReader family of OCRs handle italics quite well (they get worse error rate on italics than on regular fonts, but still the results are good enough).

Quote:
2) detection of paragraphs: the ocr software i was using, detected paragraphs within a page fine. but since it operated on a single page, it couldn't recognise, if the last sentence on a page, that ended with a period there, was also the end of a paragraph or not.
I don't know of any software that would handle the described situation well. I consider splitting and rejoining of paragraphs a necessary part of the proofing process.

Quote:
i would need something like: last sentence on a page ends with a period. -> check if first sentence on the next page is indented. if so -> new paragraph.
Personally, I think that the human way (you read it and decide if a paragraph should or shouldn't be there) the easiest. Most of the time, anyway. You can't avoid proofreading the OCRed text anyway, so you can just as well do the paragraph thing at the same time.
pepak is offline   Reply With Quote