View Single Post
Old 12-22-2010, 04:26 AM   #12
Sunlite
Addict
Sunlite ought to be getting tired of karma fortunes by now.Sunlite ought to be getting tired of karma fortunes by now.Sunlite ought to be getting tired of karma fortunes by now.Sunlite ought to be getting tired of karma fortunes by now.Sunlite ought to be getting tired of karma fortunes by now.Sunlite ought to be getting tired of karma fortunes by now.Sunlite ought to be getting tired of karma fortunes by now.Sunlite ought to be getting tired of karma fortunes by now.Sunlite ought to be getting tired of karma fortunes by now.Sunlite ought to be getting tired of karma fortunes by now.Sunlite ought to be getting tired of karma fortunes by now.
 
Sunlite's Avatar
 
Posts: 206
Karma: 547516
Join Date: Mar 2008
Location: Berlin, Germany
Device: KObo Clara, Kobo Aura, PRS-T1, PB602, CyBook Gen3
OCR (Optical Character Recognition) is a method to turn text on scanned images into actual text.

The OCR software tries to connect the shape of a letter (seen on the image) to a letter. Depending on the quality of the scan and the font used in the original book this can work well or quite horrible. For example the letters "h" and "b" are often mixed up. So are some other letter combinations.

The process of character recognition is rather complicated. That is why good OCR software is often very pricey and why Calibre does not provide it.

As far as I understand the PDF conversion in Calibre, it tries to first decide if the PDF is text based or image based. If it encounters an image based PDF, it creates an output of the images. If it encounters a text based PDF, it tries its best to convert the text to a good text based output. During that images that are still in the text based PDF get lost.

In your case I think you have a mainly image based PDF that contains some text probably at the beginning. Calibre encounters the text in the PDF and decides that the PDF is text based and produces an output of the available text. It can neither know that the images are the actual important content, nor could it convert them into text if it did.

I hope this explanation is understandable, but if you or someone else got further questions I or someone else on this board will try to answer them. We just need to know what this questions are.
Sunlite is offline   Reply With Quote