View Single Post
Old 11-04-2010, 07:52 PM   #5
thrawn_aj
quantum mechanic
thrawn_aj ought to be getting tired of karma fortunes by now.thrawn_aj ought to be getting tired of karma fortunes by now.thrawn_aj ought to be getting tired of karma fortunes by now.thrawn_aj ought to be getting tired of karma fortunes by now.thrawn_aj ought to be getting tired of karma fortunes by now.thrawn_aj ought to be getting tired of karma fortunes by now.thrawn_aj ought to be getting tired of karma fortunes by now.thrawn_aj ought to be getting tired of karma fortunes by now.thrawn_aj ought to be getting tired of karma fortunes by now.thrawn_aj ought to be getting tired of karma fortunes by now.thrawn_aj ought to be getting tired of karma fortunes by now.
 
thrawn_aj's Avatar
 
Posts: 705
Karma: 483827
Join Date: Aug 2010
Location: NorCal
Device: Nook1, Samsung Transform, Nook2
Quote:
Originally Posted by jcleaver View Post
Thanks for the replies. It seems it may be faster for me to type it from scratch. i did play with an OCR solution, and it wasn't good. It got most words correct, but it took forever to find all the little mistakes. And that was just 1 page. I literally could have typed the page faster than proofreading the converted page.
That's strange. I haven't done any scanning or OCR myself but I have seen files (usually PDF) that other people have OCR'd. Unless the source paper book was really crappy, most of the words come through alright and should require only minor editing/proofing subsequently.

Anyway, I have one thing that may help you. I noticed (based on a suggestion by another MR member - I forget who ) that mobipocket creator is MUCH more intelligent at processing PDF files into html (when it creates a publication). It removes headers and footers and even hardcoded page numbers that are scanned in and appear as flating numbers. You can then use its raw html file (which, again, is extraordinarily well-formatted considering it's generated by a program) as the input for Calibre AFTER editing the html (in a plain text editor) and using regular expressions and the like on it directly. I cleaned up several old PDFs I had this way into remarkably clean ePUBs.

Of course, the input PDF to mobicreator should be an OCR'd PDF (not page images).
thrawn_aj is offline   Reply With Quote