View Single Post
Old 03-02-2013, 01:59 PM   #341
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 477
Karma: 2505949
Join Date: Jun 2011
Location: California
Device: Kindle 2, iPad
Quote:
Originally Posted by dgvirtual View Post
Is it possible to use k2pdfopt just to get ocr text embedded into a normal (not optimized for ereader) scanned pdf?

I know there is a free Linux command line program for that - pdfsandwitch - but somehow it does not work on Lithuanian text (the embedded text misses all Lithuanian characters afterwards) while k2pdfopt seems to do lithuanian character recognition fine.
It's really the Tesseract engine that does the OCR (I assume you're using Tesseract if you're successfully OCR'ing Lithuanian text)--I can't take much credit it for it, but yes, you can do OCR without optimizing for an e-reader-sized screen:

k2pdfopt -mode copy -ocr t mydoc.pdf

You can tweak the display resolution with the -dr option, e.g. -dr 2 will double the display resolution. You can see what "-mode copy" does on my command-line usage help page.

Edit: Actually, this doesn't work for complicated (multi-column) layouts. I'll have to think about a way to do that...

Last edited by willus; 03-02-2013 at 03:21 PM.
willus is offline   Reply With Quote