View Single Post
Old 04-04-2013, 11:08 PM   #382
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by MaxStirner View Post
Is there any way for OCRing multiple language pages for example a dictionary page which is (usually) biligual? I don't have any idea if Tesseract allows doing this so it might be impossible to achieve..
This is a better question for the Tesseract folks. You can always just try the English language OCR in Tesseract and see what you get. For fun, I tried OCR-ing the attached document (multilingual.pdf) which I created using google translate. When I use the English Tesseract training pack (result in multi_eng.pdf), the first three pages--English, French, and German--OCR mostly correctly--some of the special French characters come through, but others are lost or done incorrectly, and the German umlaut doesn't come through, and the Russian (Cyrillic) doesn't get done correctly at all. When I use the Russian training pack (result in multi_rus.pdf), the Russian page is (mostly) correct, but none of the others are. So it depends partly on how different the languages are. I don't see any generic "Romance language" training packs for Tesseract, unfortunately--English is the largest training data package (other than Asian languages), so I'd guess it's your best bet for English/French/Spanish and other English-alphabet languages, though I can't say for certain. Again, a Tesseract expert would have to weigh in.

Note that to see the Russian characters correctly, you need to copy and paste the Russian PDF page into a unicode-aware application (like the google translate box in a modern browser). K2pdfopt does not use the correct Cyrillic font. The commands I used were:

k2pdfopt -mode copy -ocr t -ocrvis t multilingual.pdf -ocrlang eng -o multi_eng.pdf

k2pdfopt -mode copy -ocr t -ocrvis t multilingual.pdf -ocrlang rus -o multi_rus.pdf
Attached Files
File Type: pdf multilingual.pdf (161.2 KB, 349 views)
File Type: pdf multi_eng.pdf (16.8 KB, 344 views)
File Type: pdf multi_rus.pdf (22.0 KB, 628 views)

Last edited by willus; 04-04-2013 at 11:14 PM.
willus is offline   Reply With Quote