k2pdfopt: optimizes PDFs for viewing on e-readers - Page 101

dhdurgee · 01-16-2018, 01:46 PM

I just encountered a case where my current k2pdfopt command line is failing to work. I have been using "k2pdfopt -bpc 1 -m .11in,.04in,.14in,.4in -x -as -er 1 -ch 0.5" with pdf images of magazine articles, but this fails for a few of them. I have attached the one out of eighteen articles that for some reason is not being properly processed. I have tried a few minor tweaks, but I still get garbled output from k2pdfopt for this article.

Please help me get this one sorted out.

Dave

MarjaE · 01-16-2018, 04:26 PM

Thanks.

I finally got Tesseract working in English, but still can't get it working in the other languages I've downloaded. -ocr lan[guage] ignores the lan[guage] and does English. -ocrlang lan[guage] skips ocr.

willus · 01-16-2018, 10:16 PM

Quote:

Originally Posted by MarjaE

Thanks.

I finally got Tesseract working in English, but still can't get it working in the other languages I've downloaded. -ocr lan[guage] ignores the lan[guage] and does English. -ocrlang lan[guage] skips ocr.

I just tried this (in Windows 10):

k2pdfopt -ocrlang chi_tra -ocr t mydoc.pdf

And it seems to work, telling me it selected Chinese. Are you sure you have the other language training files in place? Here is how my Tesseract Data folder looks:

Code:

DATE      TIME                    SIZE FILE
08/23/11  02:13p                   139 rus.cube.fold
08/23/11  02:13p                   317 rus.cube.params
08/23/11  02:13p               912,800 rus.cube.nn
08/23/11  02:13p                   278 rus.cube.lm
08/23/11  02:13p             7,064,074 rus.cube.word-freq
08/23/11  02:13p            15,241,687 rus.cube.size
10/08/12  03:42p            15,636,141 rus.traineddata
10/16/12  01:00p            39,973,777 chi_sim.traineddata
10/16/12  01:00p            54,349,418 chi_tra.traineddata
10/17/12  07:55a                   254 eng.cube.params
10/17/12  07:55a               857,304 eng.cube.nn
10/17/12  07:55a               171,918 eng.cube.bigrams
10/17/12  07:55a                   181 eng.cube.lm
10/17/12  07:55a                   996 eng.tesseract_cube.nn
10/17/12  07:55a             2,444,187 eng.cube.word-freq
10/17/12  07:55a            13,020,078 eng.cube.size
10/17/12  07:55a                    38 eng.cube.fold
09/01/13  12:25p            21,876,572 eng.traineddata

willus · 01-16-2018, 11:42 PM

Quote:

Originally Posted by dhdurgee

I just encountered a case where my current k2pdfopt command line is failing to work. I have been using "k2pdfopt -bpc 1 -m .11in,.04in,.14in,.4in -x -as -er 1 -ch 0.5" with pdf images of magazine articles, but this fails for a few of them. I have attached the one out of eighteen articles that for some reason is not being properly processed. I have tried a few minor tweaks, but I still get garbled output from k2pdfopt for this article.

Please help me get this one sorted out.

Dave

It's good when debugging to run the -sm (show markings) option to see how k2pdfopt is cutting up your PDF. In this case, the -er 1 is doing you in--it is exacerbating a little mark between the columns on page 3, thereby preventing k2pdfopt from correctly finding the two columns. If you remove -er 1, it parses your document nicely.

MarjaE · 01-17-2018, 12:53 AM

P.S. After more testing, it works on some files, but not others. I know there are a couple file-by-file bugs that can derail tesseract; for example, it won't work with unstated resolution.

P.P.S. I've tried running them through k2pdfopt twice, the first time to set a resolution, and the second to ocr. No luck.

willus · 01-17-2018, 10:26 AM

Quote:

Originally Posted by MarjaE

P.S. After more testing, it works on some files, but not others. I know there are a couple file-by-file bugs that can derail tesseract; for example, it won't work with unstated resolution.

P.P.S. I've tried running them through k2pdfopt twice, the first time to set a resolution, and the second to ocr. No luck.

Can you post or PM me some examples of what you are talking about, and a screen shot of k2pdfopt converting the file? Does it claim that it correctly loaded the language, at least?

Resolution should not be an issue with k2pdfopt, so long as it is loading and processing your file without complaining, and the size of the source document looks reasonable (see attached). K2pdfopt sends the words to Tesseract one at a time to be converted by OCR, and it does not include a resolution when it does this.

dhdurgee · 01-17-2018, 12:18 PM

Quote:

Originally Posted by willus

It's good when debugging to run the -sm (show markings) option to see how k2pdfopt is cutting up your PDF. In this case, the -er 1 is doing you in--it is exacerbating a little mark between the columns on page 3, thereby preventing k2pdfopt from correctly finding the two columns. If you remove -er 1, it parses your document nicely.

Interesting, I would never have suspected -er 1 as causing the problem. I include that as sometimes the scans in the PDFs are faint and hard to read without it. Is there also an option that could counteract this by filtering out such scanning noise prior to parsing the document?

Dave

MarjaE · 01-17-2018, 09:43 PM

Here, volume 1 of Soviet Russia gives spectacularly bad ocr results in English, and the 1st issue of volume 3 completely failed to ocr:

https://www.marxists.org/history/usa/pubs/srp/index.htm

Here, volumes 1 and 2 completely failed to ocr in Russian:

http://militera.lib.ru/h/antonov-ovs...a01/index.html

willus · 01-18-2018, 12:08 AM

Quote:

Originally Posted by dhdurgee

Interesting, I would never have suspected -er 1 as causing the problem. I include that as sometimes the scans in the PDFs are faint and hard to read without it. Is there also an option that could counteract this by filtering out such scanning noise prior to parsing the document?

Dave

The -de option is supposed to do this, but it is not used by the algorithm that looks for gaps between two columns.

willus · 01-18-2018, 12:19 AM

Quote:

Originally Posted by MarjaE

Here, volume 1 of Soviet Russia gives spectacularly bad ocr results in English, and the 1st issue of volume 3 completely failed to ocr:

https://www.marxists.org/history/usa/pubs/srp/index.htm

Here, volumes 1 and 2 completely failed to ocr in Russian:

http://militera.lib.ru/h/antonov-ovs...a01/index.html

For the first case, you need to specify this option:

-ocrcol 2

E.g. k2pdfopt -mode copy -ocrcol 2 -ocr t myfile.pdf

That will get k2pdfopt to correctly OCR a 2-column document where you only want OCR applied. I've attached an OCR of page 100.

Not sure what's wrong with the antonov documents--I tried OCR-ing in Russian and it did work, though it had a number of mistakes.

k2pdfopt -mode copy -ocrlang rus -ocr t myfile.pdf

You can add the -p option to quickly test just one page of conversion, e.g.

k2pdfopt -mode copy -ocrlang rus -ocr t -p 95 myfile.pdf

I've attached this conversion as well (physical page 95). You might seriously consider getting Office 365 if you do this kind of thing a lot. I loaded one of the Russian volumes into MS Word and it did a remarkably good job converting it to Russian text. I've attached a screen shot of page 90 loaded into MS Word, with some text selected, along with a graphic of the same page directly from the original PDF file.

MarjaE · 01-18-2018, 01:41 AM

I don't know what's going on, because when I try -ocrcol 2, I get a complete failure.

willus · 01-18-2018, 09:39 PM

Quote:

Originally Posted by MarjaE

I don't know what's going on, because when I try -ocrcol 2, I get a complete failure.

A screen shot of the conversion attempt would be very helpful.

MarjaE · 01-24-2018, 02:12 PM

I don't know what a screen shot would show.

I still get a ot of errors where either (a) tesseract skips entire pages, so if I try to select anything I get a green block with the image of the page, or (b) tesseract skips most of each page, so the ocred text is a small portion of each page, and if I select anything else I get a green block with the image of the page.

MarjaE · 01-24-2018, 03:39 PM

Does K2 perform ocr before reformatting each page or after? I am trying to figure out why Tesseract in k2 isn't working for me, while Tesseract in Elucidate works. Unfortunately Elucidate re-encodes in Quartz, which doesn't work well with k2.

I also installed Tesseract on its own, but can't get it working on its own.

P.S. If K2 performs ocr after reformatting each page, maybe I could do an ocr run without reducing resolution: -mode copy -ocr (and as needed -ocrlang)

willus · 01-25-2018, 09:46 AM

Quote:

Originally Posted by MarjaE

I don't know what a screen shot would show.

You can either choose to help me out, or you can keep telling me things fail without a whole lot of detail, and the situation will likely never resolve. I try your files, give you a suggestion, and you say it's "a complete failure." I don't know what I can possibly do to diagnose a description like that, particularly when I just tried it on my system and it worked fine. I need files and screen shots to have a fighting chance. A screenshot shows me a lot of information about how you are running k2pdfopt and how it is failing.

01-17-2018, 12:53 AM	#1505
MarjaE Guru Posts: 942 Karma: 53902736 Join Date: Jun 2015 Device: multiple	P.S. After more testing, it works on some files, but not others. I know there are a couple file-by-file bugs that can derail tesseract; for example, it won't work with unstated resolution. P.P.S. I've tried running them through k2pdfopt twice, the first time to set a resolution, and the second to ocr. No luck. Last edited by MarjaE; 01-17-2018 at 02:03 AM.

01-24-2018, 03:39 PM	#1514
MarjaE Guru Posts: 942 Karma: 53902736 Join Date: Jun 2015 Device: multiple	Does K2 perform ocr before reformatting each page or after? I am trying to figure out why Tesseract in k2 isn't working for me, while Tesseract in Elucidate works. Unfortunately Elucidate re-encodes in Quartz, which doesn't work well with k2. I also installed Tesseract on its own, but can't get it working on its own. P.S. If K2 performs ocr after reformatting each page, maybe I could do an ocr run without reducing resolution: -mode copy -ocr (and as needed -ocrlang) Last edited by MarjaE; 01-24-2018 at 06:57 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Viewing PDFs with another font	Font	PocketBook	4	11-12-2010 09:27 AM
Viewing Textbook PDFs...	NJReader	enTourage Archive	4	08-17-2010 06:17 PM
PRS-600 Restart bug while viewing PDFs?	conundrum	Sony Reader	2	03-04-2010 09:46 PM
More on viewing pdfs	dso371	Bookeen	8	03-11-2008 08:15 PM
Viewing Untagged PDFs on Palm T\|X	Eroica	Reading and Management	3	12-10-2007 02:44 PM

01-16-2018, 04:26 PM	#1502
MarjaE Guru Posts: 942 Karma: 53902736 Join Date: Jun 2015 Device: multiple	Thanks. I finally got Tesseract working in English, but still can't get it working in the other languages I've downloaded. -ocr lan[guage] ignores the lan[guage] and does English. -ocrlang lan[guage] skips ocr.

01-17-2018, 09:43 PM	#1508
MarjaE Guru Posts: 942 Karma: 53902736 Join Date: Jun 2015 Device: multiple	Here, volume 1 of Soviet Russia gives spectacularly bad ocr results in English, and the 1st issue of volume 3 completely failed to ocr: https://www.marxists.org/history/usa/pubs/srp/index.htm Here, volumes 1 and 2 completely failed to ocr in Russian: http://militera.lib.ru/h/antonov-ovs...a01/index.html

01-18-2018, 01:41 AM	#1511
MarjaE Guru Posts: 942 Karma: 53902736 Join Date: Jun 2015 Device: multiple	I don't know what's going on, because when I try -ocrcol 2, I get a complete failure.

01-24-2018, 02:12 PM	#1513
MarjaE Guru Posts: 942 Karma: 53902736 Join Date: Jun 2015 Device: multiple	I don't know what a screen shot would show. I still get a ot of errors where either (a) tesseract skips entire pages, so if I try to select anything I get a green block with the image of the page, or (b) tesseract skips most of each page, so the ocred text is a small portion of each page, and if I select anything else I get a green block with the image of the page.