MobileRead Forums - View Single Post

DuckieTigger · 09-08-2018, 09:18 PM

Quote:

Originally Posted by j.p.s

PDF is based on the postscript programming language.

PDF documents that have text that can be copy and pasted have an added text layer that is not used to render or print a page. The text on any given page in a PDF document might be rendered on the spot or might be part of a pixel based image. The source of the text layer may be generated from the source text or from OCR of a pixel based image. Lots of strange errors that are not in the rendered page are evidence that the text layer is OCR based. I don't know whether any application uses the location information in the text layer for anything other to enable highlighting, copying, and pasting. It would be neat if a PDF to text application could use the location information as formatting hints and not just extract the raw text.

There is no requirement that a text layer be present and there is no requirement that a PDF document have any pixel images at all or a single text character, and it can have any mixture of them.

Pixel images in a PDF can usually be extracted and might be JPEG, JPEG2000, PNG, TIFF, or addional image types. Some images in PDF documents are vector based and can be rendered quite large with high quality and might require very little storage space.

Aye, and I didn't say that there is an embedded image file for the full page. Just that they have it. What I meant is that all the information is inside. I even mentioned that to extract the full page images you can simply print the PDF file. Printing will render out a bitmap at the specified resolution which can be redirected and converted into the correct input format for the OCR software.