View Single Post
Old 02-12-2019, 02:34 AM   #18
Ghitulescu
Fanatic
Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.
 
Posts: 563
Karma: 403106
Join Date: Aug 2014
Device: PRS-T1
Quote:
Originally Posted by DNSB View Post
I have one commercially produced scan of a book from the 1870's*
...
The text plane is useful for searching but when I take a close look at it, it has a multitude of OCR errors which make it painful to read.
Yes, it's usually this way. Old books used very "serifed"/decorative fonts that are rather difficult to be OCRed by simple/cheap software. Yes, it's annoying to replace all "m" by "r n" or "i n", all "h" by "li" and stuff. But that text exists.
Yet, I would really like to know why calibre does not see that text, or doesn't want to use it.
In my case it's a PhD theseis that was typewritten and the text (sort of Courier) is a piece of cake to OCR (and it was OCRed during the scanning).

Maybe this is/was not clear: not because it's large (they have to, because they also have images or images only), but because of the PDF->EPUB conversion. I did not want to open a new thread for a problem that was "solved" in this manner: "DO NOT use PDFs!"
Ghitulescu is offline   Reply With Quote