Quote:
Originally Posted by dgvirtual
Is it possible to use k2pdfopt just to get ocr text embedded into a normal (not optimized for ereader) scanned pdf?
I know there is a free Linux command line program for that - pdfsandwitch - but somehow it does not work on Lithuanian text (the embedded text misses all Lithuanian characters afterwards) while k2pdfopt seems to do lithuanian character recognition fine.
|
It's really the Tesseract engine that does the OCR (I assume you're using Tesseract if you're successfully OCR'ing Lithuanian text)--I can't take much credit it for it, but yes, you can do OCR without optimizing for an e-reader-sized screen:
k2pdfopt -mode copy -ocr t mydoc.pdf
You can tweak the display resolution with the
-dr option, e.g.
-dr 2 will double the display resolution. You can see what "-mode copy" does on my
command-line usage help page.
Edit: Actually, this doesn't work for complicated (multi-column) layouts. I'll have to think about a way to do that...