MobileRead Forums - View Single Post

Blossom · 11-11-2011, 05:58 PM

Quote:

Originally Posted by DiapDealer

Out of curiosity... what's your process after OCR'ing (I'm assuming ABBYY) to Word? I struggle with that step. Not that I can't get a working ebook from it, but I'm usually quite disgusted with the HTML produced by ABBYY And/Or the HTML produced by saving a Word doc as Unfiltered HTML. I spend ridiculous amounts of time trying to clean either up.

I am stuck with Word 2007 and ABBYY FineReader 9.0. Are the newer versions of each miraculously better at producing HTML that doesn't make me want to yak?

I use Word 2003 and ABBY FineReader Pro 11. I find v.11 does a fantastic job compared to v.9 which I also had and is well worth the upgrade!

I OCR the pdf in ABBYY then had it save as a Word doc with editable content and then I open up Word and clean it up a bit. This one took little to no work. Mostly checking spelling OCR errors which was only a few words and search and replace formatting like bold words that shouldn't be...etc then apply my Macros, Save as filtered html and done!

It took about 45 Minutes but that's only because I scrolled through it twice to make sure I didn't miss anything. As this is my first Topaz to PDF to Html conversation so I wanted to make sure it was well done.

Some tips I have found, make sure ABBYY isn't set to save images of the pdf to Word. Word will make a mess of it. You can manually add them later if you want.

Do not use Calibre to make your PDF it will choke ABBYY not to mention it will be 3 times the size it should.

Edit the Word doc in normal view to get how it will look on your eReader. Use the Paragraph button and learn what each character means so you can use your eye to catch things out of place.

The one I just did didn't need this extra step though but if the formatting is too messed up Book Designer 5 will fix this by converting the styles to html tags. This works great on fiction books.
Because plain text is just that and it uses the basics B & I tags...etc so it's easier to edit.

There is a trick in BD5 that will fix most broken sentences too. Just import into BD5 using "Keep Original Format" checked then save as html. Changed options to Reformat completely with Keep styles checked. Then import the html file you just saved. You find almost all broken sentences are fixed except the ones that have a capital word or ' after the break occurs.

You can then save as html and open that in Word to edit. I like to open it up in Notepad2 before Word and do some quick search replaces to change the DIV tags to P tags instead and get rid of the "    " it adds to each paragraph as a indent.

I do my final editing in Word then import into Calibre for a good readable copy that works well on my Kindle.