MobileRead Forums - View Single Post - k2pdfopt: optimizes PDFs for viewing on e-readers

willus · 08-26-2013, 12:37 AM

Quote:

Originally Posted by state

Hi there,

I am very new to ereaders in general, and I am also very new to k2pdfopt. To make matters worse, I am not so savvy with computing. However, I did attempt to set up Tesseract and the environment variable, but I still get the error as shown in the screenshot. Any ideas? Do I have to set another environment variable for kdpdfopt itself?

Also, is there a kdpdfopt guide for dummies? I appreciate the help sections on the site, but it is still a bit too fast for me. I will be utilising the programme exclusive for creating pdfs from linguistics pdfs (typically two column, with diagrams and charts, classic science articles). Thank you!

If you go to your C-drive, then the tesseract-ocr folder, there should be a "tessdata" folder, and inside that folder should be the English training files, which need to be extracted from the tar.gz file that you download from the Tesseract web site. It's a bit involved. Have you considered using Wallauer's Windows GUI from my third-party contributions page? I believe it will install the Tesseract files for you.

Are your linguistics PDFs mostly scanned or not? If they aren't scanned (if they are generated directly from a source file with the original text), you should be able to use "-mode 2col" and skip OCR altogether, e.g.

k2pdfopt -mode 2col myfile.pdf

Otherwise, OCR is probably the way to go. Sorry, there's no "for dummies" guide at the moment. All I've got is my help pages, but again, the Windows GUI may make things easier for you. You might also want to watch the video on the Native PDF page.

Edit: I've attached a screenshot of my Tesseract data folder (on my D drive). To OCR English text, you need the files shown, which have to be extracted from the downloaded training file (ends in .tar.gz).