View Single Post
Old 08-29-2012, 02:38 PM   #14
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,557
Karma: 93980341
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Quote:
Originally Posted by geekmaster View Post
Text searching in a PDF reader does not do OCR. It searches the embedded text in the PDF file. Of course that fails for "image only" PDF files (such as some zero-day pirate ebooks). PDF files can contain embedded fonts in which to render the embedded text.

The "digital representation" of the printed page is generally ASCII or one of its variants, such as Unicode. These representations of the printed page are called digital "text".

So a PDF is really just a "large format" ebook (especially when it REALLY IS the electronic version of a printed book distributed in PDF format), which can be viewed on a "large format" viewing device such as a host PC display monitor, or on a printed page (where it is no longer a PDF file).
We are talking at cross-purposes. A searchable PDF file does indeed, as you say, contain a hidden text "layer" which is used for searching. But that text layer is not what you "see" on the screen. The visual image does not contain text; it's a series of PostScript instructions of the "draw this shape at these coordinates" variety. This has nothing to do with "image only" PDFs; it's true for all PDFs. A conversion program like Calibre will not use the searchable text layer; it will attempt to assemble the text from the PDF drawing instructions but, since these can be in a pretty random order, the results can be less than optimal. That's why I say that an OCR program is often the best way to create a convertable file from a PDF, especially if the layout of the PDF is complex.

Certainly some PDFs will convert reasonably, but the fact remains that PDF was not designed to be an eBook format, and generally speaking is the worst possible choice of format if you want to convert to something reflowable.

To read PDFs well you ideally need a device which has a screen equal to (or close to) the size of the page that the PDF is formatted for. The iPad is a superb PDF reader.
HarryT is offline   Reply With Quote