Quote:
Originally Posted by DuckieTigger
The mistake is the assumption that the pdf you want to convert comes with text inside, only because most do. The correct approach would be to always start full page images, then run them with OCR, then afterwards extract the text from PDF to improve the OCR results. Universal script with overall best results - as soon as a step fails you are done with the best automatic result.
|
No assumptions being made, pdfs are either one or the other and I don't disagree, you would have to do a 2 stage run on the pdf to get the best automatic result. However, I've never seen a pdf that had both.