Quote:
Originally Posted by Toxaris
That is why I usually end up with OCR and subsequent processing... Seen too many strange things with the text export...
|
Yep, can't trust any of these dang PDF creation programs.
Even using the same program, you don't know which settings people clicked. Did they generate this PDF using LibreOffice, and enabled "Tagged PDF"? Did they generate it using InDesign using the proper (accessibility) settings? What dang "PDF Printer" did they run it through in Word (and what were the settings)?
After they generated the original PDF, did they run it through some crappy "PDF Editing" software to add a Cover/Title Page, or do something simple like ADD METADATA? (By the gods, those "Editing" softwares absolutely mangle PDFs).
Since the text is quite crisp (since it is a purely digital file), the OCR should be QUITE accurate, and have few errors.
Although enough poopooing on how bad PDF is as an input format! Let's remain positive!