MobileRead Forums - View Single Post - Need Text extraction engin from editable PDF

Tex2002ans · 05-16-2014, 03:26 PM

Quote:

Originally Posted by mrmikel

Beware! The words do not always match the text. They are only as good as the person running the OCR program and editing the result afterwards.

I believe the original post stated "the good quality(editable) PDF"... I am thinking perhaps that this is just a digitally generated PDF (for example, directly out of LaTeX/InDesign/Word/LibreOffice/etc.).

You should be able to use pdf2txt.py to extract the text directly: http://www.unixuser.org/~euske/python/pdfminer/

Hopefully, the person who originally created the PDF created it as a "tagged PDF". You should then be able to use the "-t tag" to pull the text out relatively cleanly (I am not too sure if tagged PDFs also carry the formatting in the tags as well).

There is also xpdf: http://www.foolabs.com/xpdf/download.html

and Poppler (I believe this was built to expand upon xpdf): http://poppler.freedesktop.org/

You could also try your hand at feeding it into Calibre and seeing what happens (I believe it uses Poppler on the backend?).

Quote:

Originally Posted by Toxaris

Select text, copy, paste. Within Adobe Reader (an a lot of other PDF-readers) you can also do a save as text.

Saving as Plain Text:

Won't save any formatting information.
Likely get hard line breaks
Likely get missing things like ligatures + unicode characters + dropcaps
Potentially get odd spacing issues introduced
Lose all slightly more complex objects (tables, formulas, etc. etc.)

Also, I was just taking a gander at Adobe Acrobat's site, and they have this as a feature in their Pro version:

https://www.adobe.com/products/acrob...converter.html

I doubt it works anywhere close to how they make it seem... and probably only works for documents created with Adobe's own tools. Feed it a file made from something else, and these PDF -> XYZ programs usually explode.

Quote:

Originally Posted by mrmikel

If it must be exactly the same, then you need to proofread it all word for word..very time consuming.

Indeed indeed. PDF = horrendous input format, avoid it whenever possible.

Saving as plain text or copying/pasting out of the PDF is going to cause a bunch more headaches.