Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 01-16-2018, 12:46 PM   #1501
dhdurgee
Guru
dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.
 
Posts: 829
Karma: 2525050
Join Date: Jun 2010
Device: K3W, PW4
I just encountered a case where my current k2pdfopt command line is failing to work. I have been using "k2pdfopt -bpc 1 -m .11in,.04in,.14in,.4in -x -as -er 1 -ch 0.5" with pdf images of magazine articles, but this fails for a few of them. I have attached the one out of eighteen articles that for some reason is not being properly processed. I have tried a few minor tweaks, but I still get garbled output from k2pdfopt for this article.

Please help me get this one sorted out.

Dave
Attached Files
File Type: pdf Analog_2018-01-16-The_Dissonant_Note.pdf (792.2 KB, 147 views)
dhdurgee is offline   Reply With Quote
Old 01-16-2018, 03:26 PM   #1502
MarjaE
Guru
MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.
 
Posts: 924
Karma: 53902736
Join Date: Jun 2015
Device: multiple
Thanks.

I finally got Tesseract working in English, but still can't get it working in the other languages I've downloaded. -ocr lan[guage] ignores the lan[guage] and does English. -ocrlang lan[guage] skips ocr.
MarjaE is offline   Reply With Quote
Advert
Old 01-16-2018, 09:16 PM   #1503
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by MarjaE View Post
Thanks.

I finally got Tesseract working in English, but still can't get it working in the other languages I've downloaded. -ocr lan[guage] ignores the lan[guage] and does English. -ocrlang lan[guage] skips ocr.
I just tried this (in Windows 10):

k2pdfopt -ocrlang chi_tra -ocr t mydoc.pdf

And it seems to work, telling me it selected Chinese. Are you sure you have the other language training files in place? Here is how my Tesseract Data folder looks:
Code:
DATE      TIME                    SIZE FILE
08/23/11  02:13p                   139 rus.cube.fold
08/23/11  02:13p                   317 rus.cube.params
08/23/11  02:13p               912,800 rus.cube.nn
08/23/11  02:13p                   278 rus.cube.lm
08/23/11  02:13p             7,064,074 rus.cube.word-freq
08/23/11  02:13p            15,241,687 rus.cube.size
10/08/12  03:42p            15,636,141 rus.traineddata
10/16/12  01:00p            39,973,777 chi_sim.traineddata
10/16/12  01:00p            54,349,418 chi_tra.traineddata
10/17/12  07:55a                   254 eng.cube.params
10/17/12  07:55a               857,304 eng.cube.nn
10/17/12  07:55a               171,918 eng.cube.bigrams
10/17/12  07:55a                   181 eng.cube.lm
10/17/12  07:55a                   996 eng.tesseract_cube.nn
10/17/12  07:55a             2,444,187 eng.cube.word-freq
10/17/12  07:55a            13,020,078 eng.cube.size
10/17/12  07:55a                    38 eng.cube.fold
09/01/13  12:25p            21,876,572 eng.traineddata

Last edited by willus; 01-16-2018 at 09:19 PM.
willus is offline   Reply With Quote
Old 01-16-2018, 10:42 PM   #1504
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by dhdurgee View Post
I just encountered a case where my current k2pdfopt command line is failing to work. I have been using "k2pdfopt -bpc 1 -m .11in,.04in,.14in,.4in -x -as -er 1 -ch 0.5" with pdf images of magazine articles, but this fails for a few of them. I have attached the one out of eighteen articles that for some reason is not being properly processed. I have tried a few minor tweaks, but I still get garbled output from k2pdfopt for this article.

Please help me get this one sorted out.

Dave
It's good when debugging to run the -sm (show markings) option to see how k2pdfopt is cutting up your PDF. In this case, the -er 1 is doing you in--it is exacerbating a little mark between the columns on page 3, thereby preventing k2pdfopt from correctly finding the two columns. If you remove -er 1, it parses your document nicely.
willus is offline   Reply With Quote
Old 01-16-2018, 11:53 PM   #1505
MarjaE
Guru
MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.
 
Posts: 924
Karma: 53902736
Join Date: Jun 2015
Device: multiple
P.S. After more testing, it works on some files, but not others. I know there are a couple file-by-file bugs that can derail tesseract; for example, it won't work with unstated resolution.

P.P.S. I've tried running them through k2pdfopt twice, the first time to set a resolution, and the second to ocr. No luck.

Last edited by MarjaE; 01-17-2018 at 01:03 AM.
MarjaE is offline   Reply With Quote
Advert
Old 01-17-2018, 09:26 AM   #1506
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by MarjaE View Post
P.S. After more testing, it works on some files, but not others. I know there are a couple file-by-file bugs that can derail tesseract; for example, it won't work with unstated resolution.

P.P.S. I've tried running them through k2pdfopt twice, the first time to set a resolution, and the second to ocr. No luck.
Can you post or PM me some examples of what you are talking about, and a screen shot of k2pdfopt converting the file? Does it claim that it correctly loaded the language, at least?

Resolution should not be an issue with k2pdfopt, so long as it is loading and processing your file without complaining, and the size of the source document looks reasonable (see attached). K2pdfopt sends the words to Tesseract one at a time to be converted by OCR, and it does not include a resolution when it does this.
Attached Thumbnails
Click image for larger version

Name:	screenshot.png
Views:	225
Size:	7.6 KB
ID:	161637  
willus is offline   Reply With Quote
Old 01-17-2018, 11:18 AM   #1507
dhdurgee
Guru
dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.
 
Posts: 829
Karma: 2525050
Join Date: Jun 2010
Device: K3W, PW4
Quote:
Originally Posted by willus View Post
It's good when debugging to run the -sm (show markings) option to see how k2pdfopt is cutting up your PDF. In this case, the -er 1 is doing you in--it is exacerbating a little mark between the columns on page 3, thereby preventing k2pdfopt from correctly finding the two columns. If you remove -er 1, it parses your document nicely.
Interesting, I would never have suspected -er 1 as causing the problem. I include that as sometimes the scans in the PDFs are faint and hard to read without it. Is there also an option that could counteract this by filtering out such scanning noise prior to parsing the document?

Dave
dhdurgee is offline   Reply With Quote
Old 01-17-2018, 08:43 PM   #1508
MarjaE
Guru
MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.
 
Posts: 924
Karma: 53902736
Join Date: Jun 2015
Device: multiple
Here, volume 1 of Soviet Russia gives spectacularly bad ocr results in English, and the 1st issue of volume 3 completely failed to ocr:

https://www.marxists.org/history/usa/pubs/srp/index.htm

Here, volumes 1 and 2 completely failed to ocr in Russian:

http://militera.lib.ru/h/antonov-ovs...a01/index.html
MarjaE is offline   Reply With Quote
Old 01-17-2018, 11:08 PM   #1509
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by dhdurgee View Post
Interesting, I would never have suspected -er 1 as causing the problem. I include that as sometimes the scans in the PDFs are faint and hard to read without it. Is there also an option that could counteract this by filtering out such scanning noise prior to parsing the document?

Dave
The -de option is supposed to do this, but it is not used by the algorithm that looks for gaps between two columns.
willus is offline   Reply With Quote
Old 01-17-2018, 11:19 PM   #1510
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by MarjaE View Post
Here, volume 1 of Soviet Russia gives spectacularly bad ocr results in English, and the 1st issue of volume 3 completely failed to ocr:

https://www.marxists.org/history/usa/pubs/srp/index.htm

Here, volumes 1 and 2 completely failed to ocr in Russian:

http://militera.lib.ru/h/antonov-ovs...a01/index.html
For the first case, you need to specify this option:

-ocrcol 2

E.g. k2pdfopt -mode copy -ocrcol 2 -ocr t myfile.pdf

That will get k2pdfopt to correctly OCR a 2-column document where you only want OCR applied. I've attached an OCR of page 100.

Not sure what's wrong with the antonov documents--I tried OCR-ing in Russian and it did work, though it had a number of mistakes.

k2pdfopt -mode copy -ocrlang rus -ocr t myfile.pdf

You can add the -p option to quickly test just one page of conversion, e.g.

k2pdfopt -mode copy -ocrlang rus -ocr t -p 95 myfile.pdf

I've attached this conversion as well (physical page 95). You might seriously consider getting Office 365 if you do this kind of thing a lot. I loaded one of the Russian volumes into MS Word and it did a remarkably good job converting it to Russian text. I've attached a screen shot of page 90 loaded into MS Word, with some text selected, along with a graphic of the same page directly from the original PDF file.
Attached Thumbnails
Click image for larger version

Name:	screenshot.png
Views:	277
Size:	230.5 KB
ID:	161651   Click image for larger version

Name:	screenshot_pdf.png
Views:	255
Size:	693.2 KB
ID:	161652  
Attached Files
File Type: pdf soviet_v1_p100_with_ocr.pdf (252.1 KB, 146 views)
File Type: pdf antonov_v1_p95_with_ocr.pdf (1.03 MB, 196 views)

Last edited by willus; 01-17-2018 at 11:24 PM.
willus is offline   Reply With Quote
Old 01-18-2018, 12:41 AM   #1511
MarjaE
Guru
MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.
 
Posts: 924
Karma: 53902736
Join Date: Jun 2015
Device: multiple
I don't know what's going on, because when I try -ocrcol 2, I get a complete failure.
MarjaE is offline   Reply With Quote
Old 01-18-2018, 08:39 PM   #1512
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by MarjaE View Post
I don't know what's going on, because when I try -ocrcol 2, I get a complete failure.
A screen shot of the conversion attempt would be very helpful.
willus is offline   Reply With Quote
Old 01-24-2018, 01:12 PM   #1513
MarjaE
Guru
MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.
 
Posts: 924
Karma: 53902736
Join Date: Jun 2015
Device: multiple
I don't know what a screen shot would show.

I still get a ot of errors where either (a) tesseract skips entire pages, so if I try to select anything I get a green block with the image of the page, or (b) tesseract skips most of each page, so the ocred text is a small portion of each page, and if I select anything else I get a green block with the image of the page.
MarjaE is offline   Reply With Quote
Old 01-24-2018, 02:39 PM   #1514
MarjaE
Guru
MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.
 
Posts: 924
Karma: 53902736
Join Date: Jun 2015
Device: multiple
Does K2 perform ocr before reformatting each page or after? I am trying to figure out why Tesseract in k2 isn't working for me, while Tesseract in Elucidate works. Unfortunately Elucidate re-encodes in Quartz, which doesn't work well with k2.

I also installed Tesseract on its own, but can't get it working on its own.

P.S. If K2 performs ocr after reformatting each page, maybe I could do an ocr run without reducing resolution: -mode copy -ocr (and as needed -ocrlang)

Last edited by MarjaE; 01-24-2018 at 05:57 PM.
MarjaE is offline   Reply With Quote
Old 01-25-2018, 08:46 AM   #1515
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by MarjaE View Post
I don't know what a screen shot would show.
You can either choose to help me out, or you can keep telling me things fail without a whole lot of detail, and the situation will likely never resolve. I try your files, give you a suggestion, and you say it's "a complete failure." I don't know what I can possibly do to diagnose a description like that, particularly when I just tried it on my system and it worked fine. I need files and screen shots to have a fighting chance. A screenshot shows me a lot of information about how you are running k2pdfopt and how it is failing.
willus is offline   Reply With Quote
Reply

Tags
ebook apps, k5 tools, kindle tools, kindle touch, tools


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Viewing PDFs with another font Font PocketBook 4 11-12-2010 08:27 AM
Viewing Textbook PDFs... NJReader enTourage Archive 4 08-17-2010 05:17 PM
PRS-600 Restart bug while viewing PDFs? conundrum Sony Reader 2 03-04-2010 08:46 PM
More on viewing pdfs dso371 Bookeen 8 03-11-2008 07:15 PM
Viewing Untagged PDFs on Palm T|X Eroica Reading and Management 3 12-10-2007 01:44 PM


All times are GMT -4. The time now is 11:16 PM.


MobileRead.com is a privately owned, operated and funded community.