View Single Post
Old 01-17-2018, 11:19 PM   #1510
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,299
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by MarjaE View Post
Here, volume 1 of Soviet Russia gives spectacularly bad ocr results in English, and the 1st issue of volume 3 completely failed to ocr:

https://www.marxists.org/history/usa/pubs/srp/index.htm

Here, volumes 1 and 2 completely failed to ocr in Russian:

http://militera.lib.ru/h/antonov-ovs...a01/index.html
For the first case, you need to specify this option:

-ocrcol 2

E.g. k2pdfopt -mode copy -ocrcol 2 -ocr t myfile.pdf

That will get k2pdfopt to correctly OCR a 2-column document where you only want OCR applied. I've attached an OCR of page 100.

Not sure what's wrong with the antonov documents--I tried OCR-ing in Russian and it did work, though it had a number of mistakes.

k2pdfopt -mode copy -ocrlang rus -ocr t myfile.pdf

You can add the -p option to quickly test just one page of conversion, e.g.

k2pdfopt -mode copy -ocrlang rus -ocr t -p 95 myfile.pdf

I've attached this conversion as well (physical page 95). You might seriously consider getting Office 365 if you do this kind of thing a lot. I loaded one of the Russian volumes into MS Word and it did a remarkably good job converting it to Russian text. I've attached a screen shot of page 90 loaded into MS Word, with some text selected, along with a graphic of the same page directly from the original PDF file.
Attached Thumbnails
Click image for larger version

Name:	screenshot.png
Views:	321
Size:	230.5 KB
ID:	161651   Click image for larger version

Name:	screenshot_pdf.png
Views:	302
Size:	693.2 KB
ID:	161652  
Attached Files
File Type: pdf soviet_v1_p100_with_ocr.pdf (252.1 KB, 178 views)
File Type: pdf antonov_v1_p95_with_ocr.pdf (1.03 MB, 237 views)

Last edited by willus; 01-17-2018 at 11:24 PM.
willus is offline   Reply With Quote