Quote:
Originally Posted by HarryT
Generally a lot better than those which attempt to extract text from the PDF itself. Of course no OCR is perfect, and a proofing/editing run through the converted file is essential.
|
Have you much experience extracting txt from pdfs? I have ocr to not be as good as extracting text. As I already stated, most pdfs come in two flavors, images of txt and the actual txt itself. The actual txt itself is as good as the pdf source is. Going further, extracting to xml yields so far, the best results when it comes to preserving layout but I haven't played much with converting to Postscript..yet.