Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 01-16-2018, 01:46 PM   #1501
dhdurgee
Evangelist
dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.
 
Posts: 402
Karma: 1185420
Join Date: Jun 2010
Device: K3W, KT2
I just encountered a case where my current k2pdfopt command line is failing to work. I have been using "k2pdfopt -bpc 1 -m .11in,.04in,.14in,.4in -x -as -er 1 -ch 0.5" with pdf images of magazine articles, but this fails for a few of them. I have attached the one out of eighteen articles that for some reason is not being properly processed. I have tried a few minor tweaks, but I still get garbled output from k2pdfopt for this article.

Please help me get this one sorted out.

Dave
Attached Files
File Type: pdf Analog_2018-01-16-The_Dissonant_Note.pdf (792.2 KB, 9 views)
dhdurgee is offline   Reply With Quote
Advert
Old 01-16-2018, 04:26 PM   #1502
MarjaE
Addict
MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.
 
Posts: 319
Karma: 1548692
Join Date: Jun 2015
Device: Iriver Story HD
Thanks.

I finally got Tesseract working in English, but still can't get it working in the other languages I've downloaded. -ocr lan[guage] ignores the lan[guage] and does English. -ocrlang lan[guage] skips ocr.
MarjaE is offline   Reply With Quote
Old 01-16-2018, 10:16 PM   #1503
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 963
Karma: 7562459
Join Date: Jun 2011
Location: California
Device: Kindle 2, iPad
Quote:
Originally Posted by MarjaE View Post
Thanks.

I finally got Tesseract working in English, but still can't get it working in the other languages I've downloaded. -ocr lan[guage] ignores the lan[guage] and does English. -ocrlang lan[guage] skips ocr.
I just tried this (in Windows 10):

k2pdfopt -ocrlang chi_tra -ocr t mydoc.pdf

And it seems to work, telling me it selected Chinese. Are you sure you have the other language training files in place? Here is how my Tesseract Data folder looks:
Code:
DATE      TIME                    SIZE FILE
08/23/11  02:13p                   139 rus.cube.fold
08/23/11  02:13p                   317 rus.cube.params
08/23/11  02:13p               912,800 rus.cube.nn
08/23/11  02:13p                   278 rus.cube.lm
08/23/11  02:13p             7,064,074 rus.cube.word-freq
08/23/11  02:13p            15,241,687 rus.cube.size
10/08/12  03:42p            15,636,141 rus.traineddata
10/16/12  01:00p            39,973,777 chi_sim.traineddata
10/16/12  01:00p            54,349,418 chi_tra.traineddata
10/17/12  07:55a                   254 eng.cube.params
10/17/12  07:55a               857,304 eng.cube.nn
10/17/12  07:55a               171,918 eng.cube.bigrams
10/17/12  07:55a                   181 eng.cube.lm
10/17/12  07:55a                   996 eng.tesseract_cube.nn
10/17/12  07:55a             2,444,187 eng.cube.word-freq
10/17/12  07:55a            13,020,078 eng.cube.size
10/17/12  07:55a                    38 eng.cube.fold
09/01/13  12:25p            21,876,572 eng.traineddata

Last edited by willus; 01-16-2018 at 10:19 PM.
willus is offline   Reply With Quote
Old 01-16-2018, 11:42 PM   #1504
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 963
Karma: 7562459
Join Date: Jun 2011
Location: California
Device: Kindle 2, iPad
Quote:
Originally Posted by dhdurgee View Post
I just encountered a case where my current k2pdfopt command line is failing to work. I have been using "k2pdfopt -bpc 1 -m .11in,.04in,.14in,.4in -x -as -er 1 -ch 0.5" with pdf images of magazine articles, but this fails for a few of them. I have attached the one out of eighteen articles that for some reason is not being properly processed. I have tried a few minor tweaks, but I still get garbled output from k2pdfopt for this article.

Please help me get this one sorted out.

Dave
It's good when debugging to run the -sm (show markings) option to see how k2pdfopt is cutting up your PDF. In this case, the -er 1 is doing you in--it is exacerbating a little mark between the columns on page 3, thereby preventing k2pdfopt from correctly finding the two columns. If you remove -er 1, it parses your document nicely.
willus is offline   Reply With Quote
Old 01-17-2018, 12:53 AM   #1505
MarjaE
Addict
MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.
 
Posts: 319
Karma: 1548692
Join Date: Jun 2015
Device: Iriver Story HD
P.S. After more testing, it works on some files, but not others. I know there are a couple file-by-file bugs that can derail tesseract; for example, it won't work with unstated resolution.

P.P.S. I've tried running them through k2pdfopt twice, the first time to set a resolution, and the second to ocr. No luck.

Last edited by MarjaE; 01-17-2018 at 02:03 AM.
MarjaE is offline   Reply With Quote
Old 01-17-2018, 10:26 AM   #1506
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 963
Karma: 7562459
Join Date: Jun 2011
Location: California
Device: Kindle 2, iPad
Quote:
Originally Posted by MarjaE View Post
P.S. After more testing, it works on some files, but not others. I know there are a couple file-by-file bugs that can derail tesseract; for example, it won't work with unstated resolution.

P.P.S. I've tried running them through k2pdfopt twice, the first time to set a resolution, and the second to ocr. No luck.
Can you post or PM me some examples of what you are talking about, and a screen shot of k2pdfopt converting the file? Does it claim that it correctly loaded the language, at least?

Resolution should not be an issue with k2pdfopt, so long as it is loading and processing your file without complaining, and the size of the source document looks reasonable (see attached). K2pdfopt sends the words to Tesseract one at a time to be converted by OCR, and it does not include a resolution when it does this.
Attached Thumbnails
Click image for larger version

Name:	screenshot.png
Views:	10
Size:	7.6 KB
ID:	161637  
willus is offline   Reply With Quote
Old 01-17-2018, 12:18 PM   #1507
dhdurgee
Evangelist
dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.dhdurgee ought to be getting tired of karma fortunes by now.
 
Posts: 402
Karma: 1185420
Join Date: Jun 2010
Device: K3W, KT2
Quote:
Originally Posted by willus View Post
It's good when debugging to run the -sm (show markings) option to see how k2pdfopt is cutting up your PDF. In this case, the -er 1 is doing you in--it is exacerbating a little mark between the columns on page 3, thereby preventing k2pdfopt from correctly finding the two columns. If you remove -er 1, it parses your document nicely.
Interesting, I would never have suspected -er 1 as causing the problem. I include that as sometimes the scans in the PDFs are faint and hard to read without it. Is there also an option that could counteract this by filtering out such scanning noise prior to parsing the document?

Dave
dhdurgee is offline   Reply With Quote
Old 01-17-2018, 09:43 PM   #1508
MarjaE
Addict
MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.
 
Posts: 319
Karma: 1548692
Join Date: Jun 2015
Device: Iriver Story HD
Here, volume 1 of Soviet Russia gives spectacularly bad ocr results in English, and the 1st issue of volume 3 completely failed to ocr:

https://www.marxists.org/history/usa/pubs/srp/index.htm

Here, volumes 1 and 2 completely failed to ocr in Russian:

http://militera.lib.ru/h/antonov-ovs...a01/index.html
MarjaE is offline   Reply With Quote
Old 01-18-2018, 12:08 AM   #1509
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 963
Karma: 7562459
Join Date: Jun 2011
Location: California
Device: Kindle 2, iPad
Quote:
Originally Posted by dhdurgee View Post
Interesting, I would never have suspected -er 1 as causing the problem. I include that as sometimes the scans in the PDFs are faint and hard to read without it. Is there also an option that could counteract this by filtering out such scanning noise prior to parsing the document?

Dave
The -de option is supposed to do this, but it is not used by the algorithm that looks for gaps between two columns.
willus is offline   Reply With Quote
Old 01-18-2018, 12:19 AM   #1510
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 963
Karma: 7562459
Join Date: Jun 2011
Location: California
Device: Kindle 2, iPad
Quote:
Originally Posted by MarjaE View Post
Here, volume 1 of Soviet Russia gives spectacularly bad ocr results in English, and the 1st issue of volume 3 completely failed to ocr:

https://www.marxists.org/history/usa/pubs/srp/index.htm

Here, volumes 1 and 2 completely failed to ocr in Russian:

http://militera.lib.ru/h/antonov-ovs...a01/index.html
For the first case, you need to specify this option:

-ocrcol 2

E.g. k2pdfopt -mode copy -ocrcol 2 -ocr t myfile.pdf

That will get k2pdfopt to correctly OCR a 2-column document where you only want OCR applied. I've attached an OCR of page 100.

Not sure what's wrong with the antonov documents--I tried OCR-ing in Russian and it did work, though it had a number of mistakes.

k2pdfopt -mode copy -ocrlang rus -ocr t myfile.pdf

You can add the -p option to quickly test just one page of conversion, e.g.

k2pdfopt -mode copy -ocrlang rus -ocr t -p 95 myfile.pdf

I've attached this conversion as well (physical page 95). You might seriously consider getting Office 365 if you do this kind of thing a lot. I loaded one of the Russian volumes into MS Word and it did a remarkably good job converting it to Russian text. I've attached a screen shot of page 90 loaded into MS Word, with some text selected, along with a graphic of the same page directly from the original PDF file.
Attached Thumbnails
Click image for larger version

Name:	screenshot.png
Views:	12
Size:	230.5 KB
ID:	161651   Click image for larger version

Name:	screenshot_pdf.png
Views:	11
Size:	693.2 KB
ID:	161652  
Attached Files
File Type: pdf soviet_v1_p100_with_ocr.pdf (252.1 KB, 9 views)
File Type: pdf antonov_v1_p95_with_ocr.pdf (1.03 MB, 10 views)

Last edited by willus; 01-18-2018 at 12:24 AM.
willus is offline   Reply With Quote
Old 01-18-2018, 01:41 AM   #1511
MarjaE
Addict
MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.
 
Posts: 319
Karma: 1548692
Join Date: Jun 2015
Device: Iriver Story HD
I don't know what's going on, because when I try -ocrcol 2, I get a complete failure.
MarjaE is offline   Reply With Quote
Old 01-18-2018, 09:39 PM   #1512
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 963
Karma: 7562459
Join Date: Jun 2011
Location: California
Device: Kindle 2, iPad
Quote:
Originally Posted by MarjaE View Post
I don't know what's going on, because when I try -ocrcol 2, I get a complete failure.
A screen shot of the conversion attempt would be very helpful.
willus is offline   Reply With Quote
Reply

Tags
ebook apps, k5 tools, kindle tools, kindle touch, tools

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Viewing PDFs with another font Font PocketBook 4 11-12-2010 09:27 AM
Viewing Textbook PDFs... NJReader enTourage Archive 4 08-17-2010 06:17 PM
PRS-600 Restart bug while viewing PDFs? conundrum Sony Reader 2 03-04-2010 09:46 PM
More on viewing pdfs dso371 Bookeen 8 03-11-2008 08:15 PM
Viewing Untagged PDFs on Palm T|X Eroica Reading and Management 3 12-10-2007 02:44 PM


All times are GMT -4. The time now is 05:55 AM.


MobileRead.com is a privately owned, operated and funded community.