Thread: eBooks We Need
View Single Post
Old 07-23-2008, 10:19 PM   #67
DMcCunney
New York Editor
DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.
 
DMcCunney's Avatar
 
Posts: 6,384
Karma: 16540415
Join Date: Aug 2007
Device: PalmTX, Pocket eDGe, Alcatel Fierce 4, RCA Viking Pro 10, Nexus 7
Quote:
Originally Posted by Elsi View Post
On the Google Books page, there's an option to view plain text. The "viewer" is some strange thing, but you can -- if you're careful -- copy/paste into a text document. I did this with a 200 page book. It was tedious and the viewer threw up some repeated text, but if you're patient, it may be easier than trying to OCR the PDF file. (Of course, I've never OCRer a PDF file, so it may be easier than I am thinking it would be.)
Depends on the PDF file and the OCR software.

If the PDF contains text, it may not be necessary: unlocked PDFs will have an option to save the text to a file. You lose images, fonts, formatting and the like, but you get the text.

Other PDFs are simply collections of images. Those would need to be OCR scanned, if possible.. You would also need to do substantial editing and cleanup. No OCR software guesses right all the time, and image quality is a factor. Ligatures are particular problems.

The PDF in question is a collection of images of page scans, and the View as Text is the result of OCR.
______
Dennis
DMcCunney is offline   Reply With Quote