MobileRead Forums - View Single Post

Snorkledorf · 11-18-2011, 07:41 AM

Thanks for the tip about Prince; I'll have to look into that! Anything to smack down those Topaz files!

My ebook workflow from PDF is to convert the OCRed file to Markdown-formatted text for editing, before converting it to mobi for the kindle.

I tried using Word for a while but the Styles palette drove me nuts. Markdown is like an easy-to-edit stripped-down version of HTML that calibre understands. Very human-readable so I've found it comfortable.

In more detail:

Run the scanned PDF through ABBYY FineReader 11 (running in a virtual Windows 7 machine on my Mac). Spellcheck it here.
Save as HTML.
Import that into calibre where it becomes a ZIP archive.
Convert from ZIP to TXTZ (if it has images) or TXT (if not). Set calibre's conversion output settings to Format: Markdown; Do not remove links: on; Do not remove images: on.
Rename the .txtz file to .zip, and unzip it.
Use BBEdit to clean up the resulting Markdown-formatted text file using Regular Expressions (BBEdit also understands Markdown and will preview it for me).
Import it back into calibre and convert to mobi for the Kindle.

FineReader 11 seems to be quite a bit better than v9 that I was using before. Very happy with it. Since I upgraded it's finally really worthwhile to get my scanned-but-not-OCRed-yet library out of limbo.

11-18-2011, 07:41 AM	#27
Snorkledorf Blue. Not sad...just blue Posts: 218 Karma: 1267018 Join Date: Oct 2009 Location: Japan Device: Ridibooks Paper Pro	Thanks for the tip about Prince; I'll have to look into that! Anything to smack down those Topaz files! My ebook workflow from PDF is to convert the OCRed file to Markdown-formatted text for editing, before converting it to mobi for the kindle. I tried using Word for a while but the Styles palette drove me nuts. Markdown is like an easy-to-edit stripped-down version of HTML that calibre understands. Very human-readable so I've found it comfortable. In more detail: Run the scanned PDF through ABBYY FineReader 11 (running in a virtual Windows 7 machine on my Mac). Spellcheck it here. Save as HTML. Import that into calibre where it becomes a ZIP archive. Convert from ZIP to TXTZ (if it has images) or TXT (if not). Set calibre's conversion output settings to Format: Markdown; Do not remove links: on; Do not remove images: on. Rename the .txtz file to .zip, and unzip it. Use BBEdit to clean up the resulting Markdown-formatted text file using Regular Expressions (BBEdit also understands Markdown and will preview it for me). Import it back into calibre and convert to mobi for the Kindle. FineReader 11 seems to be quite a bit better than v9 that I was using before. Very happy with it. Since I upgraded it's finally really worthwhile to get my scanned-but-not-OCRed-yet library out of limbo.