View Single Post
Old 11-18-2011, 06:41 AM   #27
Snorkledorf
Blue. Not sad...just blue
Snorkledorf ought to be getting tired of karma fortunes by now.Snorkledorf ought to be getting tired of karma fortunes by now.Snorkledorf ought to be getting tired of karma fortunes by now.Snorkledorf ought to be getting tired of karma fortunes by now.Snorkledorf ought to be getting tired of karma fortunes by now.Snorkledorf ought to be getting tired of karma fortunes by now.Snorkledorf ought to be getting tired of karma fortunes by now.Snorkledorf ought to be getting tired of karma fortunes by now.Snorkledorf ought to be getting tired of karma fortunes by now.Snorkledorf ought to be getting tired of karma fortunes by now.Snorkledorf ought to be getting tired of karma fortunes by now.
 
Snorkledorf's Avatar
 
Posts: 218
Karma: 1267018
Join Date: Oct 2009
Location: Japan
Device: Ridibooks Paper Pro
Thanks for the tip about Prince; I'll have to look into that! Anything to smack down those Topaz files!

My ebook workflow from PDF is to convert the OCRed file to Markdown-formatted text for editing, before converting it to mobi for the kindle.

I tried using Word for a while but the Styles palette drove me nuts. Markdown is like an easy-to-edit stripped-down version of HTML that calibre understands. Very human-readable so I've found it comfortable.

In more detail:
  1. Run the scanned PDF through ABBYY FineReader 11 (running in a virtual Windows 7 machine on my Mac). Spellcheck it here.
  2. Save as HTML.
  3. Import that into calibre where it becomes a ZIP archive.
  4. Convert from ZIP to TXTZ (if it has images) or TXT (if not). Set calibre's conversion output settings to Format: Markdown; Do not remove links: on; Do not remove images: on.
  5. Rename the .txtz file to .zip, and unzip it.
  6. Use BBEdit to clean up the resulting Markdown-formatted text file using Regular Expressions (BBEdit also understands Markdown and will preview it for me).
  7. Import it back into calibre and convert to mobi for the Kindle.

FineReader 11 seems to be quite a bit better than v9 that I was using before. Very happy with it. Since I upgraded it's finally really worthwhile to get my scanned-but-not-OCRed-yet library out of limbo.
Snorkledorf is offline   Reply With Quote