MobileRead Forums - View Single Post

DuckieTigger · 09-07-2018, 07:00 PM

Quote:

Originally Posted by sealbeater

Have you much experience extracting txt from pdfs? I have ocr to not be as good as extracting text. As I already stated, most pdfs come in two flavors, images of txt and the actual txt itself. The actual txt itself is as good as the pdf source is. Going further, extracting to xml yields so far, the best results when it comes to preserving layout but I haven't played much with converting to Postscript..yet.

The mistake is the assumption that the pdf you want to convert comes with text inside, only because most do. The correct approach would be to always start full page images, then run them with OCR, then afterwards extract the text from PDF to improve the OCR results. Universal script with overall best results - as soon as a step fails you are done with the best automatic result.