View Single Post
Old 05-16-2014, 03:26 PM   #6
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by mrmikel View Post
Beware! The words do not always match the text. They are only as good as the person running the OCR program and editing the result afterwards.
I believe the original post stated "the good quality(editable) PDF"... I am thinking perhaps that this is just a digitally generated PDF (for example, directly out of LaTeX/InDesign/Word/LibreOffice/etc.).

You should be able to use pdf2txt.py to extract the text directly: http://www.unixuser.org/~euske/python/pdfminer/

Hopefully, the person who originally created the PDF created it as a "tagged PDF". You should then be able to use the "-t tag" to pull the text out relatively cleanly (I am not too sure if tagged PDFs also carry the formatting in the tags as well).

There is also xpdf: http://www.foolabs.com/xpdf/download.html

and Poppler (I believe this was built to expand upon xpdf): http://poppler.freedesktop.org/

You could also try your hand at feeding it into Calibre and seeing what happens (I believe it uses Poppler on the backend?).

Quote:
Originally Posted by Toxaris View Post
Select text, copy, paste. Within Adobe Reader (an a lot of other PDF-readers) you can also do a save as text.
Saving as Plain Text:
  • Won't save any formatting information.
  • Likely get hard line breaks
  • Likely get missing things like ligatures + unicode characters + dropcaps
  • Potentially get odd spacing issues introduced
  • Lose all slightly more complex objects (tables, formulas, etc. etc.)

Also, I was just taking a gander at Adobe Acrobat's site, and they have this as a feature in their Pro version:

https://www.adobe.com/products/acrob...converter.html

I doubt it works anywhere close to how they make it seem... and probably only works for documents created with Adobe's own tools. Feed it a file made from something else, and these PDF -> XYZ programs usually explode.

Quote:
Originally Posted by mrmikel View Post
If it must be exactly the same, then you need to proofread it all word for word..very time consuming.
Indeed indeed. PDF = horrendous input format, avoid it whenever possible.

Saving as plain text or copying/pasting out of the PDF is going to cause a bunch more headaches.

Last edited by Tex2002ans; 05-16-2014 at 03:29 PM.
Tex2002ans is offline   Reply With Quote