Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 11-05-2018, 10:59 PM   #1606
polarisrising
Junior Member
polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'
 
Posts: 5
Karma: 42206
Join Date: Nov 2018
Device: kindle paperwhite 3
Quote:
Originally Posted by willus View Post
You might try adding -n- to turn off native mode and see if that works any better moving it into calibre. You can try it on just a few pages as a test, e.g. -p 10-20. I suggest this just because when you turn off native mode, k2pdfopt saves the PDF very differently--using bitmaps and its own k2pdfopt-generated OCR layer (extracted from the original OCR layer) rather than just "crop instructions" applied to the original source PDF.
This is getting very close. The images look great in the pdf and they highlight correctly. But, when I go to import the pdf to Calibre, there are two issues: If I import it without changing the settings, it imports the pdf with the images embedded, and no OCR text. If I select the option to not import the images from the pdf, then the pages are all blank.
polarisrising is offline   Reply With Quote
Old 11-06-2018, 09:28 AM   #1607
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,013
Karma: 8416109
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by polarisrising View Post
This is getting very close. The images look great in the pdf and they highlight correctly. But, when I go to import the pdf to Calibre, there are two issues: If I import it without changing the settings, it imports the pdf with the images embedded, and no OCR text. If I select the option to not import the images from the pdf, then the pages are all blank.
You may wish to try the -ocrout option which just dumps all of the OCR text to an ASCII (UTF-8) file:

-ocrout outfile.txt

You'll probably have to go through and clean it up a bit, but the OCR layer appears to be very good, so hopefully your editing will be minimal. I've attached the output from pages 20-25.
Attached Files
File Type: txt outfile.txt (26.0 KB, 8 views)
willus is offline   Reply With Quote
Old 11-06-2018, 02:26 PM   #1608
kevlar64
Junior Member
kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'
 
Posts: 3
Karma: 42206
Join Date: Nov 2018
Device: Kindle Paperwhite
Hi! First post here, amazing piece of software I have to say I am however having trouble getting tesseract to work? Despite adding all of the files you show as necessary to the tesseract folder, I get the message 'could not find tesseract data'. Any idea what might be the problem?

Thanks
kevlar64 is offline   Reply With Quote
Old 11-06-2018, 08:48 PM   #1609
polarisrising
Junior Member
polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'
 
Posts: 5
Karma: 42206
Join Date: Nov 2018
Device: kindle paperwhite 3
Quote:
Originally Posted by kevlar64 View Post
Hi! First post here, amazing piece of software I have to say I am however having trouble getting tesseract to work? Despite adding all of the files you show as necessary to the tesseract folder, I get the message 'could not find tesseract data'. Any idea what might be the problem?

Thanks
What operating system are you using? How did you install Tesseract?
polarisrising is offline   Reply With Quote
Old 11-06-2018, 08:51 PM   #1610
polarisrising
Junior Member
polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'
 
Posts: 5
Karma: 42206
Join Date: Nov 2018
Device: kindle paperwhite 3
Quote:
Originally Posted by willus View Post
You may wish to try the -ocrout option which just dumps all of the OCR text to an ASCII (UTF-8) file:

-ocrout outfile.txt

You'll probably have to go through and clean it up a bit, but the OCR layer appears to be very good, so hopefully your editing will be minimal. I've attached the output from pages 20-25.
That's great. Thank so so much for you help! It strips the formatting, but really, that should be too hard to patch back up. I really appreciate the help, and I'll try my best to help others in return.

polarisrising is offline   Reply With Quote
Old 11-07-2018, 06:41 PM   #1611
ekinrot
Junior Member
ekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five words
 
Posts: 5
Karma: 37936
Join Date: Sep 2018
Device: kindle paperwhite 7th (5.10.1.1)
How to fix these small sentences
Attached Thumbnails
Click image for larger version

Name:	Untitled.png
Views:	16
Size:	88.3 KB
ID:	167516   Click image for larger version

Name:	Untitled2.png
Views:	10
Size:	77.6 KB
ID:	167517  
ekinrot is offline   Reply With Quote
Old 11-07-2018, 09:51 PM   #1612
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,013
Karma: 8416109
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by ekinrot View Post
How to fix these small sentences
It would help if you could post the source or PM me the source PDF file.
willus is offline   Reply With Quote
Old 11-08-2018, 05:15 AM   #1613
ekinrot
Junior Member
ekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five wordsekinrot can name that ebook in five words
 
Posts: 5
Karma: 37936
Join Date: Sep 2018
Device: kindle paperwhite 7th (5.10.1.1)
Ok my friend
ekinrot is offline   Reply With Quote
Old 11-09-2018, 12:10 AM   #1614
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,013
Karma: 8416109
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by ekinrot View Post
How to fix these small sentences
Any kind of intelligent formatting that k2pdfopt tries to do will likely not be very successful because of how diverse your source PDF is (some pages are 2 columns, some are not, and lots of pages have specially positioned text relative to a figure), so I'd recommend just a straight cropping of every page into 2 pages (2 columns) using the grid option, even when it will cut a figure in half:

k2pdfopt -grid 2x1 sourcefile.pdf
willus is offline   Reply With Quote
Old 11-09-2018, 06:36 AM   #1615
kevlar64
Junior Member
kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'
 
Posts: 3
Karma: 42206
Join Date: Nov 2018
Device: Kindle Paperwhite
Quote:
Originally Posted by polarisrising View Post
What operating system are you using? How did you install Tesseract?
Windows 10. The FAQ gave the impression that it was not necessary to install tesseract, and that only the training data files were necessary.

"NOTE! To use the Tesseract OCR engine built into k2pdfopt, you only have to install the Tesseract language training file for your language (see example below for English). You do not need to install the Tesseract engine! You can install multiple language files if you want to be able to OCR documents in different lanugages. "


Have I missed something?
kevlar64 is offline   Reply With Quote
Old 11-09-2018, 06:46 PM   #1616
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,013
Karma: 8416109
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by kevlar64 View Post
Windows 10. The FAQ gave the impression that it was not necessary to install tesseract, and that only the training data files were necessary.

"NOTE! To use the Tesseract OCR engine built into k2pdfopt, you only have to install the Tesseract language training file for your language (see example below for English). You do not need to install the Tesseract engine! You can install multiple language files if you want to be able to OCR documents in different lanugages. "


Have I missed something?
That is correct. Did you read all the way through the OCR help page? It's important that you set the TESSDATA_PREFIX environment variable to point to the correct folder where you have stored the training files. I just now clarified the environment variable instructions--they had some mistakes.

Last edited by willus; 11-09-2018 at 06:57 PM.
willus is offline   Reply With Quote
Old Yesterday, 10:39 AM   #1617
kevlar64
Junior Member
kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'
 
Posts: 3
Karma: 42206
Join Date: Nov 2018
Device: Kindle Paperwhite
Quote:
Originally Posted by willus View Post
That is correct. Did you read all the way through the OCR help page? It's important that you set the TESSDATA_PREFIX environment variable to point to the correct folder where you have stored the training files. I just now clarified the environment variable instructions--they had some mistakes.
Hey,

Yes, I read through all of it, and I followed the instructions exactly to the best of my knowledge. Any idea what might be the problem? Let me know screenshots of anything in particular would be useful
kevlar64 is offline   Reply With Quote
Reply

Tags
ebook apps, k5 tools, kindle tools, kindle touch, tools

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Viewing PDFs with another font Font PocketBook 4 11-12-2010 09:27 AM
Viewing Textbook PDFs... NJReader enTourage Archive 4 08-17-2010 06:17 PM
PRS-600 Restart bug while viewing PDFs? conundrum Sony Reader 2 03-04-2010 09:46 PM
More on viewing pdfs dso371 Bookeen 8 03-11-2008 08:15 PM
Viewing Untagged PDFs on Palm T|X Eroica Reading and Management 3 12-10-2007 02:44 PM


All times are GMT -4. The time now is 01:35 AM.


MobileRead.com is a privately owned, operated and funded community.