Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 11-18-2018, 03:30 PM   #1621
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,020
Karma: 8416109
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by gg4u View Post
Hi Willus,

I tried to install tessdata v.3.05 from:

https://github.com/tesseract-ocr/tessdata

It works, processing now , I ll check result when finish but at least it is working.

Could you tell which files I need to keep to process eng language?

Would you consider to update to tesseract v.4.0 ?

I looked at git repos for k2pdfopt but:
- could not compile for I miss header file: k2pdfopt.h
- I don't much C neither tesseract to make modification to your wrapper :/
I am hoping to eventually compile w/Tesseract 4.0.0. It was just officially released only three weeks ago (Oct 29, 2018). I don't recommend trying to build k2pdfopt yourself unless you are pretty adventurous. It has a lot of dependencies.

For Tesseract 3.0.5, you need these files in your data folder:

eng.cube.params
eng.cube.nn
eng.cube.bigrams
eng.cube.lm
eng.tesseract_cube.nn
eng.cube.word-freq
eng.cube.size
eng.cube.fold
eng.traineddata
willus is offline   Reply With Quote
Old 11-19-2018, 10:53 AM   #1622
gg4u
Junior Member
gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'
 
Posts: 7
Karma: 42206
Join Date: Nov 2018
Device: Kindle 8
oh thank you Willus,

keeping eng.file only will free up some space on disk.

Would you suggest hpw to make best use of k2pdfopt ?

I'd like to reflow a pdf - of scanned images - in a epub containins figures, and chapters.

k2pdfopt seems to detect where images are, I processed the original pdf into OCRed version, and characters are blurred.

I tried to make comparison by using ghostscript and tesseract:
from pdf to tiff, from tiff to txt.

Here, results where quite good but I miss all the figures and markup for chapters.

As final result for written text, I would like to have epub or mobi (sharp rendering of chars) , not pdf , but yet with the figures - and TOC .

Maybe is there another file but txt, that tessearct export to and that will keep images (RTF)?

I could eventually manually mark the TOC - which is correct markup?

What kind of steps should I take to convert pdf in epub containing images and markup ?

I also shared this thread https://www.mobileread.com/forums/sh...d.php?t=312652


Can I also ask you how you approached the problem to be able detect figures in PDF - interested in problem solving
gg4u is offline   Reply With Quote
Old 11-22-2018, 09:35 AM   #1623
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,020
Karma: 8416109
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by gg4u View Post
oh thank you Willus,

keeping eng.file only will free up some space on disk.

Would you suggest hpw to make best use of k2pdfopt ?

I'd like to reflow a pdf - of scanned images - in a epub containins figures, and chapters.

k2pdfopt seems to detect where images are, I processed the original pdf into OCRed version, and characters are blurred.

I tried to make comparison by using ghostscript and tesseract:
from pdf to tiff, from tiff to txt.

Here, results where quite good but I miss all the figures and markup for chapters.

As final result for written text, I would like to have epub or mobi (sharp rendering of chars) , not pdf , but yet with the figures - and TOC .

Maybe is there another file but txt, that tessearct export to and that will keep images (RTF)?

I could eventually manually mark the TOC - which is correct markup?

What kind of steps should I take to convert pdf in epub containing images and markup ?

I also shared this thread https://www.mobileread.com/forums/sh...d.php?t=312652


Can I also ask you how you approached the problem to be able detect figures in PDF - interested in problem solving
I've collected together my ideas on converting PDFs onto a web page that I've had up for a while. No magic solution, other than maybe trying MS Word if you have access to it. I don't really detect figures--just places where I don't see gaps that would occur between normal rows of text. If I don't find a gap for more than a given span (~1.5 inches), I consider that a figure. Very simplistic.
willus is offline   Reply With Quote
Old 11-23-2018, 06:27 PM   #1624
kevlar64
Junior Member
kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'kevlar64 understands when you whisper 'The dog barks at midnight.'
 
Posts: 4
Karma: 42206
Join Date: Nov 2018
Device: Kindle Paperwhite
Quote:
Originally Posted by willus View Post
Are you comfortable running things from the command line? (That gives me an idea to put a Tesseract diagnostic into the GUI...)
I'm certainly willing to give it a go! Is there a guide for that?
kevlar64 is offline   Reply With Quote
Old 11-23-2018, 06:46 PM   #1625
gg4u
Junior Member
gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'
 
Posts: 7
Karma: 42206
Join Date: Nov 2018
Device: Kindle 8
Great!

Maybe I m still missing a step:

reading options:
https://www.willus.com/k2pdfopt/help/options.shtml

can't find how to finalise transforming OCRed pdf to mobi format.

I processed a pdf of pure images (written text could not be selected),
I now have a new pdf, which appears like this:


https://imgur.com/UG5wKoa

where text can be selected, but stilla pdf with quality of text that can be improved.

I tried to convert this one to mobi with calibre, but output remain identical:
Chars are not clean, and words are not selectable as "real fonts".

By comparison, if I only use tesseract, I will have a final txt file, I will be able to resize and select font on the ereader but will lose the images, photos, tables etc.

Can you show steps I should do to convert pdf to mobi, to obtain experience of an ebook ?
(not pdf as final output)
gg4u is offline   Reply With Quote
Old 11-24-2018, 03:02 PM   #1626
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,020
Karma: 8416109
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by msh2050 View Post
any update?
It dawned on me that the -vs option conflicts with the -bp option in what you were trying to do. You used -bp 3 to try and put a 3-inch gap between each source page in your output PDF, but the -vs option limits how large gaps can be in the destination document, and it defaults to 0.25 inches max. So if you add this to your options:

-vs 3

The -bp 3 option will do what you want. I should still have -bp 3 override this, but that's a quick fix for you.
willus is offline   Reply With Quote
Old 11-24-2018, 03:05 PM   #1627
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,020
Karma: 8416109
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by gg4u View Post
Can you show steps I should do to convert pdf to mobi, to obtain experience of an ebook ?
(not pdf as final output)
This is really not my area of expertise. The primary goal of k2pdfopt is to convert PDF to PDF, not PDF to epub or mobi. Even though I realize it can be a part of that process, I have no experience using it that way.
willus is offline   Reply With Quote
Old 11-27-2018, 07:18 AM   #1628
Umesh
Junior Member
Umesh understands when you whisper 'The dog barks at midnight.'Umesh understands when you whisper 'The dog barks at midnight.'Umesh understands when you whisper 'The dog barks at midnight.'Umesh understands when you whisper 'The dog barks at midnight.'Umesh understands when you whisper 'The dog barks at midnight.'Umesh understands when you whisper 'The dog barks at midnight.'Umesh understands when you whisper 'The dog barks at midnight.'Umesh understands when you whisper 'The dog barks at midnight.'Umesh understands when you whisper 'The dog barks at midnight.'Umesh understands when you whisper 'The dog barks at midnight.'Umesh understands when you whisper 'The dog barks at midnight.'
 
Posts: 1
Karma: 42206
Join Date: Nov 2018
Device: Ipad
Using text to speech on converted files

Hello Willus, thanks for creating this wonderful piece of software. I use it quite a lot to convert my pdfs. However I do face one problem, i.e when i try to use foxit pdf to read aloud some of my pdfs, it reads them in a haphazard way with lot of text being repeated. When i tried to select text on the converted pdf, it showed that although the text is currently in a single coloumn, the ghost of the previous 2 coloumn file is somehow visible to my foxit reader and it also reads it aloud, resulting in repetition. Does native pdf output have something to do with it? What settings would work best if i want to use a text reader on my converted files. If you could please guide.
Attached Thumbnails
Click image for larger version

Name:	image.jpg
Views:	16
Size:	813.3 KB
ID:	168042  

Last edited by issybird; 11-28-2018 at 08:30 AM. Reason: Image thumbnailed.
Umesh is offline   Reply With Quote
Old 11-28-2018, 09:42 AM   #1629
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,020
Karma: 8416109
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by Umesh View Post
Hello Willus, thanks for creating this wonderful piece of software. I use it quite a lot to convert my pdfs. However I do face one problem, i.e when i try to use foxit pdf to read aloud some of my pdfs, it reads them in a haphazard way with lot of text being repeated. When i tried to select text on the converted pdf, it showed that although the text is currently in a single coloumn, the ghost of the previous 2 coloumn file is somehow visible to my foxit reader and it also reads it aloud, resulting in repetition. Does native pdf output have something to do with it? What settings would work best if i want to use a text reader on my converted files. If you could please guide.
Try adding the -ppgs option or clicking on the "post-process with ghostscript" checkbox if using the MS Windows GUI. You'll need to install Ghostscript if you haven't yet.
Code:
Command Line Options
--------------------
-ppgs[-]          Post process [do not post process] with ghostscript.  This
                  will take the final PDF output and process it using
                  ghostscript's pdfwrite device (assuming ghostscript is
                  available).  A benefit to doing this is that all "invisible"
                  and/or overlapping text regions (outside cropping areas) get
                  completely removed, so that text selection capability is
                  improved.  The actual ghostscript command used is:
                  gs -dSAFER -dBATCH -q -dNOPAUSE -sDEVICE=pdfwrite
                     -dPDFSETTINGS=/prepress -sOutputFile=<outfile>
                     <srcfile>
                  The default is not to post process with ghostscript.
willus is offline   Reply With Quote
Old 12-06-2018, 11:19 AM   #1630
mauricebis
Junior Member
mauricebis began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Oct 2016
Device: kindle
can't get tesseract to work on linux

Hello,

Thanks for k2pdfopt. I tried to use the ocr option on Linux but it doesn't seem to be happy with the path for finding tessdata and yet I extracted the files tesseract-ocr-3.02.fra.tar.gz in the mentioned directory. Do I need to put other files as tessdata ? Thanks.
mauricebis is offline   Reply With Quote
Old 12-06-2018, 03:35 PM   #1631
mauricebis
Junior Member
mauricebis began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Oct 2016
Device: kindle
I reply to myself since I found the issue. There should be a subdirectory /tessdata under the path prefix.
mauricebis is offline   Reply With Quote
Old 12-06-2018, 09:28 PM   #1632
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,020
Karma: 8416109
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by mauricebis View Post
I reply to myself since I found the issue. There should be a subdirectory /tessdata under the path prefix.
On-line help page fixed. I'm confusing myself--the new version I'm working on for Tesseract 4.0 does not require the tessdata subfolder. Sorry about that.
willus is offline   Reply With Quote
Reply

Tags
ebook apps, k5 tools, kindle tools, kindle touch, tools

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Viewing PDFs with another font Font PocketBook 4 11-12-2010 09:27 AM
Viewing Textbook PDFs... NJReader enTourage Archive 4 08-17-2010 06:17 PM
PRS-600 Restart bug while viewing PDFs? conundrum Sony Reader 2 03-04-2010 09:46 PM
More on viewing pdfs dso371 Bookeen 8 03-11-2008 08:15 PM
Viewing Untagged PDFs on Palm T|X Eroica Reading and Management 3 12-10-2007 02:44 PM


All times are GMT -4. The time now is 03:12 PM.


MobileRead.com is a privately owned, operated and funded community.