Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 11-05-2018, 09:59 PM   #1606
polarisrising
Junior Member
polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'
 
Posts: 5
Karma: 42206
Join Date: Nov 2018
Device: kindle paperwhite 3
Quote:
Originally Posted by willus View Post
You might try adding -n- to turn off native mode and see if that works any better moving it into calibre. You can try it on just a few pages as a test, e.g. -p 10-20. I suggest this just because when you turn off native mode, k2pdfopt saves the PDF very differently--using bitmaps and its own k2pdfopt-generated OCR layer (extracted from the original OCR layer) rather than just "crop instructions" applied to the original source PDF.
This is getting very close. The images look great in the pdf and they highlight correctly. But, when I go to import the pdf to Calibre, there are two issues: If I import it without changing the settings, it imports the pdf with the images embedded, and no OCR text. If I select the option to not import the images from the pdf, then the pages are all blank.
polarisrising is offline   Reply With Quote
Old 11-06-2018, 08:28 AM   #1607
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by polarisrising View Post
This is getting very close. The images look great in the pdf and they highlight correctly. But, when I go to import the pdf to Calibre, there are two issues: If I import it without changing the settings, it imports the pdf with the images embedded, and no OCR text. If I select the option to not import the images from the pdf, then the pages are all blank.
You may wish to try the -ocrout option which just dumps all of the OCR text to an ASCII (UTF-8) file:

-ocrout outfile.txt

You'll probably have to go through and clean it up a bit, but the OCR layer appears to be very good, so hopefully your editing will be minimal. I've attached the output from pages 20-25.
Attached Files
File Type: txt outfile.txt (26.0 KB, 204 views)
willus is offline   Reply With Quote
Advert
Old 11-06-2018, 01:26 PM   #1608
kevlar64
Junior Member
kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'
 
Posts: 4
Karma: 42206
Join Date: Nov 2018
Device: Kindle Paperwhite
Hi! First post here, amazing piece of software I have to say I am however having trouble getting tesseract to work? Despite adding all of the files you show as necessary to the tesseract folder, I get the message 'could not find tesseract data'. Any idea what might be the problem?

Thanks
kevlar64 is offline   Reply With Quote
Old 11-06-2018, 07:48 PM   #1609
polarisrising
Junior Member
polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'
 
Posts: 5
Karma: 42206
Join Date: Nov 2018
Device: kindle paperwhite 3
Quote:
Originally Posted by kevlar64 View Post
Hi! First post here, amazing piece of software I have to say I am however having trouble getting tesseract to work? Despite adding all of the files you show as necessary to the tesseract folder, I get the message 'could not find tesseract data'. Any idea what might be the problem?

Thanks
What operating system are you using? How did you install Tesseract?
polarisrising is offline   Reply With Quote
Old 11-06-2018, 07:51 PM   #1610
polarisrising
Junior Member
polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'
 
Posts: 5
Karma: 42206
Join Date: Nov 2018
Device: kindle paperwhite 3
Quote:
Originally Posted by willus View Post
You may wish to try the -ocrout option which just dumps all of the OCR text to an ASCII (UTF-8) file:

-ocrout outfile.txt

You'll probably have to go through and clean it up a bit, but the OCR layer appears to be very good, so hopefully your editing will be minimal. I've attached the output from pages 20-25.
That's great. Thank so so much for you help! It strips the formatting, but really, that should be too hard to patch back up. I really appreciate the help, and I'll try my best to help others in return.

polarisrising is offline   Reply With Quote
Advert
Old 11-07-2018, 05:41 PM   #1611
ekinrot
Junior Member
ekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five words
 
Posts: 5
Karma: 37936
Join Date: Sep 2018
Device: kindle paperwhite 7th (5.10.1.1)
How to fix these small sentences
Attached Thumbnails
Click image for larger version

Name:	Untitled.png
Views:	227
Size:	88.3 KB
ID:	167516   Click image for larger version

Name:	Untitled2.png
Views:	257
Size:	77.6 KB
ID:	167517  
ekinrot is offline   Reply With Quote
Old 11-07-2018, 08:51 PM   #1612
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by ekinrot View Post
How to fix these small sentences
It would help if you could post the source or PM me the source PDF file.
willus is offline   Reply With Quote
Old 11-08-2018, 04:15 AM   #1613
ekinrot
Junior Member
ekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five words
 
Posts: 5
Karma: 37936
Join Date: Sep 2018
Device: kindle paperwhite 7th (5.10.1.1)
Ok my friend
ekinrot is offline   Reply With Quote
Old 11-08-2018, 11:10 PM   #1614
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by ekinrot View Post
How to fix these small sentences
Any kind of intelligent formatting that k2pdfopt tries to do will likely not be very successful because of how diverse your source PDF is (some pages are 2 columns, some are not, and lots of pages have specially positioned text relative to a figure), so I'd recommend just a straight cropping of every page into 2 pages (2 columns) using the grid option, even when it will cut a figure in half:

k2pdfopt -grid 2x1 sourcefile.pdf
willus is offline   Reply With Quote
Old 11-09-2018, 05:36 AM   #1615
kevlar64
Junior Member
kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'
 
Posts: 4
Karma: 42206
Join Date: Nov 2018
Device: Kindle Paperwhite
Quote:
Originally Posted by polarisrising View Post
What operating system are you using? How did you install Tesseract?
Windows 10. The FAQ gave the impression that it was not necessary to install tesseract, and that only the training data files were necessary.

"NOTE! To use the Tesseract OCR engine built into k2pdfopt, you only have to install the Tesseract language training file for your language (see example below for English). You do not need to install the Tesseract engine! You can install multiple language files if you want to be able to OCR documents in different lanugages. "


Have I missed something?
kevlar64 is offline   Reply With Quote
Old 11-09-2018, 05:46 PM   #1616
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by kevlar64 View Post
Windows 10. The FAQ gave the impression that it was not necessary to install tesseract, and that only the training data files were necessary.

"NOTE! To use the Tesseract OCR engine built into k2pdfopt, you only have to install the Tesseract language training file for your language (see example below for English). You do not need to install the Tesseract engine! You can install multiple language files if you want to be able to OCR documents in different lanugages. "


Have I missed something?
That is correct. Did you read all the way through the OCR help page? It's important that you set the TESSDATA_PREFIX environment variable to point to the correct folder where you have stored the training files. I just now clarified the environment variable instructions--they had some mistakes.

Last edited by willus; 11-09-2018 at 05:57 PM.
willus is offline   Reply With Quote
Old 11-12-2018, 09:39 AM   #1617
kevlar64
Junior Member
kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'
 
Posts: 4
Karma: 42206
Join Date: Nov 2018
Device: Kindle Paperwhite
Quote:
Originally Posted by willus View Post
That is correct. Did you read all the way through the OCR help page? It's important that you set the TESSDATA_PREFIX environment variable to point to the correct folder where you have stored the training files. I just now clarified the environment variable instructions--they had some mistakes.
Hey,

Yes, I read through all of it, and I followed the instructions exactly to the best of my knowledge. Any idea what might be the problem? Let me know screenshots of anything in particular would be useful
kevlar64 is offline   Reply With Quote
Old 11-14-2018, 09:52 PM   #1618
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by kevlar64 View Post
Yes, I read through all of it, and I followed the instructions exactly to the best of my knowledge. Any idea what might be the problem? Let me know screenshots of anything in particular would be useful
Are you comfortable running things from the command line? (That gives me an idea to put a Tesseract diagnostic into the GUI...)
willus is offline   Reply With Quote
Old 11-18-2018, 05:38 AM   #1619
gg4u
Junior Member
gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'
 
Posts: 7
Karma: 42206
Join Date: Nov 2018
Device: Kindle 8
Tesseract 4.0.0 - environment variable cannot find tessdata (mac)

Hello everybody, Willus thank you so much for having taken time to help in this.

Im Mac user, still 10.9, I had to install tesseract via brew.
Tesseract version 4.0.0.

Folder of tessdata is:

/usr/local/Cellar/tesseract/4.0.0/share/tessdata/

Now, I set environment variable as:

export TESSDATA_PREFIX=/usr/local/Cellar/tesseract/4.0.0/share/

( I tried also without last slash:
export TESSDATA_PREFIX=/usr/local/Cellar/tesseract/4.0.0/share )

But I keep having error that cannot pick up the tessdata files (I m using command line):


Initializing OCR for 4 threads xxxx
Could not find Tesseract data (env var TESSDATA_PREFIX = /usr/local/Cellar/tesseract/4.0.0/share/).
Using GOCR v0.50.



Note tessdata folder contains:
configs eng.traineddata osd.traineddata pdf.ttf tessconfigs


Maybe a change in the version files from tesseract 3. to 4. ?
Or am I mistyping something with env var?

As test, I exctracted a tif file from a pdf with ghostscript, run tesseract:
tesseract -l eng mypdf.tif mypdf

it works.


Can you help fix k2pdfopt be able recognise tesseract installation ?
gg4u is offline   Reply With Quote
Old 11-18-2018, 09:40 AM   #1620
gg4u
Junior Member
gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'
 
Posts: 7
Karma: 42206
Join Date: Nov 2018
Device: Kindle 8
Tesseract 4.0.0 - environment variable cannot find tessdata (mac) (2)

Hi Willus,

I tried to install tessdata v.3.05 from:

https://github.com/tesseract-ocr/tessdata

It works, processing now , I ll check result when finish but at least it is working.

Could you tell which files I need to keep to process eng language?

Would you consider to update to tesseract v.4.0 ?

I looked at git repos for k2pdfopt but:
- could not compile for I miss header file: k2pdfopt.h
- I don't much C neither tesseract to make modification to your wrapper :/
gg4u is offline   Reply With Quote
Reply

Tags
ebook apps, k5 tools, kindle tools, kindle touch, tools


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Viewing PDFs with another font Font PocketBook 4 11-12-2010 08:27 AM
Viewing Textbook PDFs... NJReader enTourage Archive 4 08-17-2010 05:17 PM
PRS-600 Restart bug while viewing PDFs? conundrum Sony Reader 2 03-04-2010 08:46 PM
More on viewing pdfs dso371 Bookeen 8 03-11-2008 07:15 PM
Viewing Untagged PDFs on Palm T|X Eroica Reading and Management 3 12-10-2007 01:44 PM


All times are GMT -4. The time now is 06:05 AM.


MobileRead.com is a privately owned, operated and funded community.