View Single Post
Old 08-26-2013, 12:37 AM   #507
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,303
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by state View Post
Hi there,

I am very new to ereaders in general, and I am also very new to k2pdfopt. To make matters worse, I am not so savvy with computing. However, I did attempt to set up Tesseract and the environment variable, but I still get the error as shown in the screenshot. Any ideas? Do I have to set another environment variable for kdpdfopt itself?

Also, is there a kdpdfopt guide for dummies? I appreciate the help sections on the site, but it is still a bit too fast for me. I will be utilising the programme exclusive for creating pdfs from linguistics pdfs (typically two column, with diagrams and charts, classic science articles). Thank you!
If you go to your C-drive, then the tesseract-ocr folder, there should be a "tessdata" folder, and inside that folder should be the English training files, which need to be extracted from the tar.gz file that you download from the Tesseract web site. It's a bit involved. Have you considered using Wallauer's Windows GUI from my third-party contributions page? I believe it will install the Tesseract files for you.

Are your linguistics PDFs mostly scanned or not? If they aren't scanned (if they are generated directly from a source file with the original text), you should be able to use "-mode 2col" and skip OCR altogether, e.g.

k2pdfopt -mode 2col myfile.pdf

Otherwise, OCR is probably the way to go. Sorry, there's no "for dummies" guide at the moment. All I've got is my help pages, but again, the Windows GUI may make things easier for you. You might also want to watch the video on the Native PDF page.

Edit: I've attached a screenshot of my Tesseract data folder (on my D drive). To OCR English text, you need the files shown, which have to be extracted from the downloaded training file (ends in .tar.gz).
Attached Thumbnails
Click image for larger version

Name:	tessfiles_english.png
Views:	357
Size:	71.0 KB
ID:	110017  

Last edited by willus; 08-28-2013 at 08:31 AM. Reason: Typo corrected
willus is offline   Reply With Quote