01-16-2018, 12:46 PM | #1501 |
Guru
Posts: 841
Karma: 2525050
Join Date: Jun 2010
Device: K3W, PW4
|
I just encountered a case where my current k2pdfopt command line is failing to work. I have been using "k2pdfopt -bpc 1 -m .11in,.04in,.14in,.4in -x -as -er 1 -ch 0.5" with pdf images of magazine articles, but this fails for a few of them. I have attached the one out of eighteen articles that for some reason is not being properly processed. I have tried a few minor tweaks, but I still get garbled output from k2pdfopt for this article.
Please help me get this one sorted out. Dave |
01-16-2018, 03:26 PM | #1502 |
Guru
Posts: 927
Karma: 53902736
Join Date: Jun 2015
Device: multiple
|
Thanks.
I finally got Tesseract working in English, but still can't get it working in the other languages I've downloaded. -ocr lan[guage] ignores the lan[guage] and does English. -ocrlang lan[guage] skips ocr. |
01-16-2018, 09:16 PM | #1503 | |
Fuzzball, the purple cat
Posts: 1,274
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
k2pdfopt -ocrlang chi_tra -ocr t mydoc.pdf And it seems to work, telling me it selected Chinese. Are you sure you have the other language training files in place? Here is how my Tesseract Data folder looks: Code:
DATE TIME SIZE FILE 08/23/11 02:13p 139 rus.cube.fold 08/23/11 02:13p 317 rus.cube.params 08/23/11 02:13p 912,800 rus.cube.nn 08/23/11 02:13p 278 rus.cube.lm 08/23/11 02:13p 7,064,074 rus.cube.word-freq 08/23/11 02:13p 15,241,687 rus.cube.size 10/08/12 03:42p 15,636,141 rus.traineddata 10/16/12 01:00p 39,973,777 chi_sim.traineddata 10/16/12 01:00p 54,349,418 chi_tra.traineddata 10/17/12 07:55a 254 eng.cube.params 10/17/12 07:55a 857,304 eng.cube.nn 10/17/12 07:55a 171,918 eng.cube.bigrams 10/17/12 07:55a 181 eng.cube.lm 10/17/12 07:55a 996 eng.tesseract_cube.nn 10/17/12 07:55a 2,444,187 eng.cube.word-freq 10/17/12 07:55a 13,020,078 eng.cube.size 10/17/12 07:55a 38 eng.cube.fold 09/01/13 12:25p 21,876,572 eng.traineddata Last edited by willus; 01-16-2018 at 09:19 PM. |
|
01-16-2018, 10:42 PM | #1504 | |
Fuzzball, the purple cat
Posts: 1,274
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
|
|
01-16-2018, 11:53 PM | #1505 |
Guru
Posts: 927
Karma: 53902736
Join Date: Jun 2015
Device: multiple
|
P.S. After more testing, it works on some files, but not others. I know there are a couple file-by-file bugs that can derail tesseract; for example, it won't work with unstated resolution.
P.P.S. I've tried running them through k2pdfopt twice, the first time to set a resolution, and the second to ocr. No luck. Last edited by MarjaE; 01-17-2018 at 01:03 AM. |
01-17-2018, 09:26 AM | #1506 | |
Fuzzball, the purple cat
Posts: 1,274
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
Resolution should not be an issue with k2pdfopt, so long as it is loading and processing your file without complaining, and the size of the source document looks reasonable (see attached). K2pdfopt sends the words to Tesseract one at a time to be converted by OCR, and it does not include a resolution when it does this. |
|
01-17-2018, 11:18 AM | #1507 | |
Guru
Posts: 841
Karma: 2525050
Join Date: Jun 2010
Device: K3W, PW4
|
Quote:
Dave |
|
01-17-2018, 08:43 PM | #1508 |
Guru
Posts: 927
Karma: 53902736
Join Date: Jun 2015
Device: multiple
|
Here, volume 1 of Soviet Russia gives spectacularly bad ocr results in English, and the 1st issue of volume 3 completely failed to ocr:
https://www.marxists.org/history/usa/pubs/srp/index.htm Here, volumes 1 and 2 completely failed to ocr in Russian: http://militera.lib.ru/h/antonov-ovs...a01/index.html |
01-17-2018, 11:08 PM | #1509 | |
Fuzzball, the purple cat
Posts: 1,274
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
|
|
01-17-2018, 11:19 PM | #1510 | |
Fuzzball, the purple cat
Posts: 1,274
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
-ocrcol 2 E.g. k2pdfopt -mode copy -ocrcol 2 -ocr t myfile.pdf That will get k2pdfopt to correctly OCR a 2-column document where you only want OCR applied. I've attached an OCR of page 100. Not sure what's wrong with the antonov documents--I tried OCR-ing in Russian and it did work, though it had a number of mistakes. k2pdfopt -mode copy -ocrlang rus -ocr t myfile.pdf You can add the -p option to quickly test just one page of conversion, e.g. k2pdfopt -mode copy -ocrlang rus -ocr t -p 95 myfile.pdf I've attached this conversion as well (physical page 95). You might seriously consider getting Office 365 if you do this kind of thing a lot. I loaded one of the Russian volumes into MS Word and it did a remarkably good job converting it to Russian text. I've attached a screen shot of page 90 loaded into MS Word, with some text selected, along with a graphic of the same page directly from the original PDF file. Last edited by willus; 01-17-2018 at 11:24 PM. |
|
01-18-2018, 12:41 AM | #1511 |
Guru
Posts: 927
Karma: 53902736
Join Date: Jun 2015
Device: multiple
|
I don't know what's going on, because when I try -ocrcol 2, I get a complete failure.
|
01-18-2018, 08:39 PM | #1512 |
Fuzzball, the purple cat
Posts: 1,274
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
|
01-24-2018, 01:12 PM | #1513 |
Guru
Posts: 927
Karma: 53902736
Join Date: Jun 2015
Device: multiple
|
I don't know what a screen shot would show.
I still get a ot of errors where either (a) tesseract skips entire pages, so if I try to select anything I get a green block with the image of the page, or (b) tesseract skips most of each page, so the ocred text is a small portion of each page, and if I select anything else I get a green block with the image of the page. |
01-24-2018, 02:39 PM | #1514 |
Guru
Posts: 927
Karma: 53902736
Join Date: Jun 2015
Device: multiple
|
Does K2 perform ocr before reformatting each page or after? I am trying to figure out why Tesseract in k2 isn't working for me, while Tesseract in Elucidate works. Unfortunately Elucidate re-encodes in Quartz, which doesn't work well with k2.
I also installed Tesseract on its own, but can't get it working on its own. P.S. If K2 performs ocr after reformatting each page, maybe I could do an ocr run without reducing resolution: -mode copy -ocr (and as needed -ocrlang) Last edited by MarjaE; 01-24-2018 at 05:57 PM. |
01-25-2018, 08:46 AM | #1515 |
Fuzzball, the purple cat
Posts: 1,274
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
You can either choose to help me out, or you can keep telling me things fail without a whole lot of detail, and the situation will likely never resolve. I try your files, give you a suggestion, and you say it's "a complete failure." I don't know what I can possibly do to diagnose a description like that, particularly when I just tried it on my system and it worked fine. I need files and screen shots to have a fighting chance. A screenshot shows me a lot of information about how you are running k2pdfopt and how it is failing.
|
Tags |
ebook apps, k5 tools, kindle tools, kindle touch, tools |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Viewing PDFs with another font | Font | PocketBook | 4 | 11-12-2010 08:27 AM |
Viewing Textbook PDFs... | NJReader | enTourage Archive | 4 | 08-17-2010 05:17 PM |
PRS-600 Restart bug while viewing PDFs? | conundrum | Sony Reader | 2 | 03-04-2010 08:46 PM |
More on viewing pdfs | dso371 | Bookeen | 8 | 03-11-2008 07:15 PM |
Viewing Untagged PDFs on Palm T|X | Eroica | Reading and Management | 3 | 12-10-2007 01:44 PM |