Thanks for the tip about Prince; I'll have to look into that! Anything to smack down those Topaz files!
My ebook workflow from PDF is to convert the OCRed file to Markdown-formatted text for editing, before converting it to mobi for the kindle.
I tried using Word for a while but the Styles palette drove me nuts.
Markdown is like an easy-to-edit stripped-down version of HTML that calibre understands. Very human-readable so I've found it comfortable.
In more detail:
- Run the scanned PDF through ABBYY FineReader 11 (running in a virtual Windows 7 machine on my Mac). Spellcheck it here.
- Save as HTML.
- Import that into calibre where it becomes a ZIP archive.
- Convert from ZIP to TXTZ (if it has images) or TXT (if not). Set calibre's conversion output settings to Format: Markdown; Do not remove links: on; Do not remove images: on.
- Rename the .txtz file to .zip, and unzip it.
- Use BBEdit to clean up the resulting Markdown-formatted text file using Regular Expressions (BBEdit also understands Markdown and will preview it for me).
- Import it back into calibre and convert to mobi for the Kindle.
FineReader 11 seems to be quite a bit better than v9 that I was using before. Very happy with it. Since I upgraded it's finally really worthwhile to get my scanned-but-not-OCRed-yet library out of limbo.