MobileRead Forums - View Single Post

j.p.s · 09-08-2018, 05:51 PM

Quote:

Originally Posted by DuckieTigger

No they are not either or. Even the PDF that contains text has full page images. You simply create them by printing the PDF into individual images for each page. OCR has a better chance to succeed than possibly horribly garbled text inside that won't tell you where the header is, for example.

PDF is based on the postscript programming language.

PDF documents that have text that can be copy and pasted have an added text layer that is not used to render or print a page. The text on any given page in a PDF document might be rendered on the spot or might be part of a pixel based image. The source of the text layer may be generated from the source text or from OCR of a pixel based image. Lots of strange errors that are not in the rendered page are evidence that the text layer is OCR based. I don't know whether any application uses the location information in the text layer for anything other to enable highlighting, copying, and pasting. It would be neat if a PDF to text application could use the location information as formatting hints and not just extract the raw text.

There is no requirement that a text layer be present and there is no requirement that a PDF document have any pixel images at all or a single text character, and it can have any mixture of them.

Pixel images in a PDF can usually be extracted and might be JPEG, JPEG2000, PNG, TIFF, or addional image types. Some images in PDF documents are vector based and can be rendered quite large with high quality and might require very little storage space.