View Single Post
Old 04-03-2012, 05:22 PM   #18
TechSarge
Junior Member
TechSarge began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Feb 2012
Location: Florida USA
Device: Kindle 4 SO (Died), Kindle Fire HD 7"
As the OP, I'd like to give an update:

Finished my first book a couple of weeks ago. It's a paperback of which there is no e-copy available (BTW folks, in this instance, scanning a book which you already own isn't piracy, it is fair use and legal. Same as making a backup copy of a music CD you own, or ripping said CD to MP3.).

I scanned all pages to TIFs, using an ancient Lexmark X1100 series AIO scanner I have here (I was very careful with the book, as I don't like flattening it out on that flatbed scanner). Pages were run through ScanTailor to straighten out any misaligned scans and to cut the double pages apart. Pages were then run through Adobe Acrobat 9 Enhanced's OCR function, with Clear Scan enabled. The OCR output was saved as html, as I didn't know how to save as xhtml then (do now). Files were then opened in Sigil, for editing, proofreading, etc.

I have to say that for this particular book, Acrobat's OCR engine sucks. It took me probably 36 hours of proofing to fix everything, as I had to read and re-read the book to catch all of the errors - everything from a single wrong letter in a word, to entire sentences missing from the text. Forget about italics, they were always wrong or nonexistent.

A few things I'd like to change:

Sigil did a good job formatting the things I thought it would choke on, such as the map at the beginning of the book. It did choke on line drawings at the beginning of each chapter, though, so I had to cut n' paste one from one of the original scans as a bitmap and use that for each chapter. Ugly, but worked.

The gobs of extra lines in the text has to go. Thankfully, I found out how to deal with this in Calibre. Along with paragraph indentation. Sigil has no capacity for this, and it's a serious oversight, as it's touted as a friggin' editor! In this day and age, one shouldn't need to go into the code to do such obvious tweaking.

Sigil changes things in the book once you save it. I saved changes to Chapter Two FOUR TIMES (a simple justify center of the word "TWO" in the beginning of the file). Each time when I opened the book on my device, "TWO" was justify left instead of center. As it is the last noticeable error in the book, I said "screw it" and am leaving it as-is, as I'm not going to mess with it anymore.

If I didn't have the paper copy of the book here to proof the OCR against, I couldn't have finished this sane (and this was only a 250 page paperback!). If I was working solely from original scans, on only a laptop and not a multiple monitor setup, the constant flipping back and forth would have driven me nuts. The next few books are going to be much more challenging, with triple column text on each page, and/or lots of inserted line art or photos. The fonts are a lot older as well, which will (I'm sure) give Acrobat's OCR even more fits. I gotta either figure out how to improve Acrobat's accuracy, or get a different OCR engine.

I am very, very proud of the job I did on this e-book, though. It is as attractive to look at and to read as any commercially published e-book I've read.

Suggestions as to better software or changes to workflow are quite welcome. I'm starting on my second project very soon.
TechSarge is offline   Reply With Quote