Originally Posted by dgvirtual
Is it possible to use k2pdfopt just to get ocr text embedded into a normal (not optimized for ereader) scanned pdf?
I know there is a free Linux command line program for that - pdfsandwitch - but somehow it does not work on Lithuanian text (the embedded text misses all Lithuanian characters afterwards) while k2pdfopt seems to do lithuanian character recognition fine.
It's really the Tesseract engine that does the OCR (I assume you're using Tesseract if you're successfully OCR'ing Lithuanian text)--I can't take much credit it for it, but yes, you can do OCR without optimizing for an e-reader-sized screen:
k2pdfopt -mode copy -ocr t mydoc.pdf
You can tweak the display resolution with the -dr
option, e.g. -dr 2
will double the display resolution. You can see what "-mode copy" does on my command-line usage help page
Edit: Actually, this doesn't work for complicated (multi-column) layouts. I'll have to think about a way to do that...