MobileRead Forums - View Single Post

thrawn_aj · 11-04-2010, 08:52 PM

Quote:

Originally Posted by jcleaver

Thanks for the replies. It seems it may be faster for me to type it from scratch. i did play with an OCR solution, and it wasn't good. It got most words correct, but it took forever to find all the little mistakes. And that was just 1 page. I literally could have typed the page faster than proofreading the converted page.

That's strange. I haven't done any scanning or OCR myself but I have seen files (usually PDF) that other people have OCR'd. Unless the source paper book was really crappy, most of the words come through alright and should require only minor editing/proofing subsequently.

Anyway, I have one thing that may help you. I noticed (based on a suggestion by another MR member - I forget who

) that mobipocket creator is MUCH more intelligent at processing PDF files into html (when it creates a publication). It removes headers and footers and even hardcoded page numbers that are scanned in and appear as flating numbers. You can then use its raw html file (which, again, is extraordinarily well-formatted considering it's generated by a program) as the input for Calibre AFTER editing the html (in a plain text editor) and using regular expressions and the like on it directly. I cleaned up several old PDFs I had this way into remarkably clean ePUBs.

Of course, the input PDF to mobicreator should be an OCR'd PDF (not page images).