View Single Post
Old 01-25-2010, 03:37 PM   #355
clarknova
Addict
clarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with others
 
clarknova's Avatar
 
Posts: 241
Karma: 2617
Join Date: Mar 2009
Location: Greenwood, SC
Device: Kindle 2
Quote:
Originally Posted by Coconut View Post
using ubuntu, I installed the librsvg2-bin package, which I used for conversion. The commandline I used -- in svg directory -- was "for i in page*.svg; do rsvg-convert -a -f pdf $i -o `echo $i | sed -e ' s/svg$/pdf/'`; done"

This created individual pdf's for each page. A total of 305 pages, at 197 megabytes. I combined those using Acrobat, and then ran 'optimize for OCR'. The resulting file is beautiful, with all images, and smooth, and weighs in at 3407K. Awesome.
Yeah, I ended up doing something similar using Inkscape to render each page to a huge PNG (1200dpi!) then importing them in Acrobat, using Acrobat to OCR the pages and then optimizing it. The result isn't as pretty as the SVG (because the glyphs have been rasterized) but it's totally usable as a searchable PDF that retains the original book formatting at an acceptable filesize.

Unfortunately, I have no use for PDFs (since PDF isn't an ebook format). But for people that do, this is certainly an option, providing they have Acrobat Pro.

(Acrobat's OCR is neat. There are a few more OCR errors than in the original Topaz file, but it attempts to preserve style -- though not very well...)

Kevin: All of the open source OCR stuff is pretty obsolete and useless. The errors tend to be way more than in the Topaz file or even Adobe's OCR.

Personally, I find the genhtml to be the most usable. I just have to convert (using imagemagick or illustrator or inkscape or whatever) the Monogram and Table svgs that get generated into PNG/JPEG so I can create an ePub out of the data.
clarknova is offline   Reply With Quote