View Single Post
Old 01-25-2010, 03:48 PM   #356
Coconut
Junior Member
Coconut began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Jan 2010
Device: Kindle DX
Quote:
Originally Posted by clarknova View Post
Yeah, I ended up doing something similar using Inkscape to render each page to a huge PNG (1200dpi!) then importing them in Acrobat, using Acrobat to OCR the pages and then optimizing it. The result isn't as pretty as the SVG (because the glyphs have been rasterized) but it's totally usable as a searchable PDF that retains the original book formatting at an acceptable filesize.

Unfortunately, I have no use for PDFs (since PDF isn't an ebook format). But for people that do, this is certainly an option, providing they have Acrobat Pro.

(Acrobat's OCR is neat. There are a few more OCR errors than in the original Topaz file, but it attempts to preserve style -- though not very well...)

Kevin: All of the open source OCR stuff is pretty obsolete and useless. The errors tend to be way more than in the Topaz file or even Adobe's OCR.

Personally, I find the genhtml to be the most usable. I just have to convert (using imagemagick or illustrator or inkscape or whatever) the Monogram and Table svgs that get generated into PNG/JPEG so I can create an ePub out of the data.
The optimize for OCR (not OCR itself, just image adjustment) is a function in Acrobat. It neatly shades the rasterized images used to reduce filesize. The size I used was much smaller than you did, since there is really no point in that kind of resolution unless you plan on publishing the thing. I go for a size that's easily readable on my kindle -- again, the interest is in proper pagination -- and other screens. I'm really very happy with how it came out.

For OCR I actually used Finereader, which does a great job. The pdf I end up with is essentially error free. Finereader can also export to a variety of formats (paged and non-paged). I would not be surprised if html outputted by it surpasses what we've been able to produce, since it retains formatting. I'll try that later. Do we have a standard text to use for conversion and comparison of different methods? It's really the only way to determine what works best.

Last edited by Coconut; 01-25-2010 at 03:51 PM.
Coconut is offline   Reply With Quote