View Single Post
Old 08-17-2013, 08:25 PM   #503
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,303
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Dual language OCR example with k2pdfopt and Tesseract

Quote:
Originally Posted by MaxStirner View Post
Sorry to bother you again Wilus but maybe you remeber my question about multilanguage support. Yesterday I was perusing through Tesserract google group without any speciffic reason and suddenly stumbled accross this post
https://groups.google.com/forum/#!ms...I/QMMHDV_GWRIJ
Don't know if this is of any help to you but just in case..
Tesseract's dual language OCR actually seems to work in k2pdfopt v1.66, though not very well at all in my test case, where I mixed English and Chinese. I used this command:

k2pdfopt -ocr dual_english_chinese.pdf -mode copy -ocrlang language

where I substituted different values for language: eng, chi_tra, chi_tra+eng, and eng+chi_tra. See the attached files. The best results, by far, were using only chi_tra alone, which sort of defeats the purpose of dual language OCR(!), but each result was different, so I am assuming that the actual mechanism of passing lang1+lang2 to Tesseract is working and that this was just a particularly poor case for Tesseract. Maybe mixed European languages will work better?
Attached Thumbnails
Click image for larger version

Name:	dualocr_english_chinese_results.png
Views:	380
Size:	143.5 KB
ID:	109569  
Attached Files
File Type: pdf dual_english_chinese.pdf (45.5 KB, 304 views)
willus is offline   Reply With Quote