Quote:
Originally Posted by DNSB
I have one commercially produced scan of a book from the 1870's*
...
The text plane is useful for searching but when I take a close look at it, it has a multitude of OCR errors which make it painful to read.
|
Yes, it's usually this way. Old books used very "serifed"/decorative fonts that are rather difficult to be OCRed by simple/cheap software. Yes, it's annoying to replace all "m" by "r n" or "i n", all "h" by "li" and stuff. But that text exists.
Yet, I would really like to know why calibre does not see that text, or doesn't want to use it.
In my case it's a PhD theseis that was typewritten and the text (sort of Courier) is a piece of cake to OCR (and it was OCRed during the scanning).
Maybe this is/was not clear: not because it's large (they have to, because they also have images or images only), but because of the PDF->EPUB conversion. I did not want to open a new thread for a problem that was "solved" in this manner: "DO NOT use PDFs!"