Quote:
Originally Posted by mrmikel
Beware! The words do not always match the text. They are only as good as the person running the OCR program and editing the result afterwards.
|
I believe the original post stated "the good quality(editable) PDF"... I am thinking perhaps that this is just a digitally generated PDF (for example, directly out of LaTeX/InDesign/Word/LibreOffice/etc.).
You should be able to use pdf2txt.py to extract the text directly:
http://www.unixuser.org/~euske/python/pdfminer/
Hopefully, the person who originally created the PDF created it as a "tagged PDF". You should then be able to use the "-t tag" to pull the text out relatively cleanly (I am not too sure if tagged PDFs also carry the formatting in the tags as well).
There is also xpdf:
http://www.foolabs.com/xpdf/download.html
and Poppler (I believe this was built to expand upon xpdf):
http://poppler.freedesktop.org/
You could also try your hand at feeding it into Calibre and seeing what happens (I believe it uses Poppler on the backend?).
Quote:
Originally Posted by Toxaris
Select text, copy, paste. Within Adobe Reader (an a lot of other PDF-readers) you can also do a save as text.
|
Saving as Plain Text:
- Won't save any formatting information.
- Likely get hard line breaks
- Likely get missing things like ligatures + unicode characters + dropcaps
- Potentially get odd spacing issues introduced
- Lose all slightly more complex objects (tables, formulas, etc. etc.)
Also, I was just taking a gander at Adobe Acrobat's site, and they have this as a feature in their Pro version:
https://www.adobe.com/products/acrob...converter.html
I doubt it works anywhere close to how they make it seem... and probably only works for documents created with Adobe's own tools. Feed it a file made from something else, and these PDF -> XYZ programs usually explode.
Quote:
Originally Posted by mrmikel
If it must be exactly the same, then you need to proofread it all word for word..very time consuming.
|
Indeed indeed. PDF = horrendous input format, avoid it whenever possible.
Saving as plain text or copying/pasting out of the PDF is going to cause a bunch more headaches.