View Single Post
Old 02-17-2013, 08:49 PM   #52
kevinp
Fanatic
kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.kevinp ought to be getting tired of karma fortunes by now.
 
kevinp's Avatar
 
Posts: 579
Karma: 3549018
Join Date: Jul 2004
Location: Michigan
Device: Kindle Scribe, Kindle PW (10th & 11th gen); Fire HD 10
Quote:
Originally Posted by Turtle91 View Post
I was under the impression that Acrobat - even Pro - doesn't keep the formatting when you save to text. In which case you will not have any of the italics, bold, superscript, etc.

Assuming the PDF they give you is a perfect OCR of the original - you would still need to go back and manually format the entire book to make it like the original.

I did an experiment by creating a test page in Word with different formatting of sections of text. I then saved that document as a PDF. This provides a "perfect OCR of the original image". When I opened that PDF in Acrobat Pro, everything looked as it should and I could perform a find on any of the words in there. I then saved the PDF as text. Acrobat gives 2 options, Plain text and Accessible text - I did both. In both cases the text was correct but without ANY formatting.

If there is a different way of saving a PDF to text, I would be very interested to know how.
The (so-called) OCR in Acrobat is mainly just so you can search for text in the PDF. It's not made to do what you are thinking.

That's why you need to use an actual OCR program. I use Abbyy. It will open a PDF and extract the pages as TIF files, then do it's thing. And it works fairly well on stuff like paperbacks. It will capture bold, italics, etc.

If you want to OCR stuff like textbooks that contain lots of illustrations and such, I don't know of anything that works 100%.
kevinp is offline   Reply With Quote