MobileRead Forums - View Single Post - k2pdfopt: optimizes PDFs for viewing on e-readers

willus · 04-04-2013, 11:08 PM

Quote:

Originally Posted by MaxStirner

Is there any way for OCRing multiple language pages for example a dictionary page which is (usually) biligual? I don't have any idea if Tesseract allows doing this so it might be impossible to achieve..

This is a better question for the Tesseract folks. You can always just try the English language OCR in Tesseract and see what you get. For fun, I tried OCR-ing the attached document (multilingual.pdf) which I created using google translate. When I use the English Tesseract training pack (result in multi_eng.pdf), the first three pages--English, French, and German--OCR mostly correctly--some of the special French characters come through, but others are lost or done incorrectly, and the German umlaut doesn't come through, and the Russian (Cyrillic) doesn't get done correctly at all. When I use the Russian training pack (result in multi_rus.pdf), the Russian page is (mostly) correct, but none of the others are. So it depends partly on how different the languages are. I don't see any generic "Romance language" training packs for Tesseract, unfortunately--English is the largest training data package (other than Asian languages), so I'd guess it's your best bet for English/French/Spanish and other English-alphabet languages, though I can't say for certain. Again, a Tesseract expert would have to weigh in.

Note that to see the Russian characters correctly, you need to copy and paste the Russian PDF page into a unicode-aware application (like the google translate box in a modern browser). K2pdfopt does not use the correct Cyrillic font. The commands I used were:

k2pdfopt -mode copy -ocr t -ocrvis t multilingual.pdf -ocrlang eng -o multi_eng.pdf

k2pdfopt -mode copy -ocr t -ocrvis t multilingual.pdf -ocrlang rus -o multi_rus.pdf