View Single Post
Old 09-07-2018, 07:00 PM   #73
DuckieTigger
Wizard
DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.DuckieTigger ought to be getting tired of karma fortunes by now.
 
DuckieTigger's Avatar
 
Posts: 4,763
Karma: 246906703
Join Date: Dec 2011
Location: USA
Device: Oasis 3, Oasis 2, PW3, PW1, KT
Quote:
Originally Posted by sealbeater View Post
Have you much experience extracting txt from pdfs? I have ocr to not be as good as extracting text. As I already stated, most pdfs come in two flavors, images of txt and the actual txt itself. The actual txt itself is as good as the pdf source is. Going further, extracting to xml yields so far, the best results when it comes to preserving layout but I haven't played much with converting to Postscript..yet.
The mistake is the assumption that the pdf you want to convert comes with text inside, only because most do. The correct approach would be to always start full page images, then run them with OCR, then afterwards extract the text from PDF to improve the OCR results. Universal script with overall best results - as soon as a step fails you are done with the best automatic result.
DuckieTigger is offline   Reply With Quote