11-18-2018, 02:30 PM | #1621 | |
Fuzzball, the purple cat
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
For Tesseract 3.0.5, you need these files in your data folder: eng.cube.params eng.cube.nn eng.cube.bigrams eng.cube.lm eng.tesseract_cube.nn eng.cube.word-freq eng.cube.size eng.cube.fold eng.traineddata |
|
11-19-2018, 09:53 AM | #1622 |
Junior Member
Posts: 7
Karma: 42206
Join Date: Nov 2018
Device: Kindle 8
|
oh thank you Willus,
keeping eng.file only will free up some space on disk. Would you suggest hpw to make best use of k2pdfopt ? I'd like to reflow a pdf - of scanned images - in a epub containins figures, and chapters. k2pdfopt seems to detect where images are, I processed the original pdf into OCRed version, and characters are blurred. I tried to make comparison by using ghostscript and tesseract: from pdf to tiff, from tiff to txt. Here, results where quite good but I miss all the figures and markup for chapters. As final result for written text, I would like to have epub or mobi (sharp rendering of chars) , not pdf , but yet with the figures - and TOC . Maybe is there another file but txt, that tessearct export to and that will keep images (RTF)? I could eventually manually mark the TOC - which is correct markup? What kind of steps should I take to convert pdf in epub containing images and markup ? I also shared this thread https://www.mobileread.com/forums/sh...d.php?t=312652 Can I also ask you how you approached the problem to be able detect figures in PDF - interested in problem solving |
Advert | |
|
11-22-2018, 08:35 AM | #1623 | |
Fuzzball, the purple cat
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
|
|
11-23-2018, 05:27 PM | #1624 |
Junior Member
Posts: 4
Karma: 42206
Join Date: Nov 2018
Device: Kindle Paperwhite
|
|
11-23-2018, 05:46 PM | #1625 |
Junior Member
Posts: 7
Karma: 42206
Join Date: Nov 2018
Device: Kindle 8
|
Great!
Maybe I m still missing a step: reading options: https://www.willus.com/k2pdfopt/help/options.shtml can't find how to finalise transforming OCRed pdf to mobi format. I processed a pdf of pure images (written text could not be selected), I now have a new pdf, which appears like this: https://imgur.com/UG5wKoa where text can be selected, but stilla pdf with quality of text that can be improved. I tried to convert this one to mobi with calibre, but output remain identical: Chars are not clean, and words are not selectable as "real fonts". By comparison, if I only use tesseract, I will have a final txt file, I will be able to resize and select font on the ereader but will lose the images, photos, tables etc. Can you show steps I should do to convert pdf to mobi, to obtain experience of an ebook ? (not pdf as final output) |
Advert | |
|
11-24-2018, 02:02 PM | #1626 |
Fuzzball, the purple cat
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
It dawned on me that the -vs option conflicts with the -bp option in what you were trying to do. You used -bp 3 to try and put a 3-inch gap between each source page in your output PDF, but the -vs option limits how large gaps can be in the destination document, and it defaults to 0.25 inches max. So if you add this to your options:
-vs 3 The -bp 3 option will do what you want. I should still have -bp 3 override this, but that's a quick fix for you. |
11-24-2018, 02:05 PM | #1627 |
Fuzzball, the purple cat
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
This is really not my area of expertise. The primary goal of k2pdfopt is to convert PDF to PDF, not PDF to epub or mobi. Even though I realize it can be a part of that process, I have no experience using it that way.
|
11-27-2018, 06:18 AM | #1628 |
Nameless Being
|
Using text to speech on converted files
Hello Willus, thanks for creating this wonderful piece of software. I use it quite a lot to convert my pdfs. However I do face one problem, i.e when i try to use foxit pdf to read aloud some of my pdfs, it reads them in a haphazard way with lot of text being repeated. When i tried to select text on the converted pdf, it showed that although the text is currently in a single coloumn, the ghost of the previous 2 coloumn file is somehow visible to my foxit reader and it also reads it aloud, resulting in repetition. Does native pdf output have something to do with it? What settings would work best if i want to use a text reader on my converted files. If you could please guide.
Last edited by issybird; 11-28-2018 at 07:30 AM. Reason: Image thumbnailed. |
11-28-2018, 08:42 AM | #1629 | |
Fuzzball, the purple cat
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
Code:
Command Line Options -------------------- -ppgs[-] Post process [do not post process] with ghostscript. This will take the final PDF output and process it using ghostscript's pdfwrite device (assuming ghostscript is available). A benefit to doing this is that all "invisible" and/or overlapping text regions (outside cropping areas) get completely removed, so that text selection capability is improved. The actual ghostscript command used is: gs -dSAFER -dBATCH -q -dNOPAUSE -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress -sOutputFile=<outfile> <srcfile> The default is not to post process with ghostscript. |
|
12-06-2018, 10:19 AM | #1630 |
Junior Member
Posts: 4
Karma: 42208
Join Date: Oct 2016
Device: kindle
|
can't get tesseract to work on linux
Hello,
Thanks for k2pdfopt. I tried to use the ocr option on Linux but it doesn't seem to be happy with the path for finding tessdata and yet I extracted the files tesseract-ocr-3.02.fra.tar.gz in the mentioned directory. Do I need to put other files as tessdata ? Thanks. |
12-06-2018, 02:35 PM | #1631 |
Junior Member
Posts: 4
Karma: 42208
Join Date: Oct 2016
Device: kindle
|
I reply to myself since I found the issue. There should be a subdirectory /tessdata under the path prefix.
|
12-06-2018, 08:28 PM | #1632 | |
Fuzzball, the purple cat
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
|
|
12-27-2018, 12:39 PM | #1633 |
Fuzzball, the purple cat
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
k2pdfopt v2.50 released
K2pdfopt v2.50 is released. The major enhancement in this release is compiling in the Tesseract v4.0.0 library. Most other third-party libraries have also been updated to recent releases. See details at the web site.
|
12-30-2018, 04:10 PM | #1634 |
Junior Member
Posts: 1
Karma: 42208
Join Date: Dec 2018
Device: KT
|
Problem with version 2.50
Hi Willus
For some reason version 2.50 64 bits is not working, tried in win10 64 bits and is not working, it does nothing, also tried in command line with -gui- , again nothing happens. Version 2.50 32 bits works OK, also version 2.42 64 bits works OK. I redownloaded this time from willus.org(vs willus.com) and reverified the sha-256 checksum, but does not work. BY the way out of curiosity, what the fast preview option does ? I've done previews with it selected and unselected and couldn't perceive any difference. |
12-30-2018, 07:21 PM | #1635 | |
Enthusiast
Posts: 27
Karma: 122330
Join Date: Sep 2017
Device: ipad , Kindle PW3
|
Quote:
thanks for help it worked now... thats is great... Please note that I try to run the ocr but the program close as it start to use the tesseract ... (I try with/without GUI) and I try 64-32 and the old cpu versions please find attachment of the cmd error that I get before it close. regards Last edited by msh2050; 12-30-2018 at 07:25 PM. Reason: adding more info |
|
Tags |
ebook apps, k5 tools, kindle tools, kindle touch, tools |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Viewing PDFs with another font | Font | PocketBook | 4 | 11-12-2010 08:27 AM |
Viewing Textbook PDFs... | NJReader | enTourage Archive | 4 | 08-17-2010 05:17 PM |
PRS-600 Restart bug while viewing PDFs? | conundrum | Sony Reader | 2 | 03-04-2010 08:46 PM |
More on viewing pdfs | dso371 | Bookeen | 8 | 03-11-2008 07:15 PM |
Viewing Untagged PDFs on Palm T|X | Eroica | Reading and Management | 3 | 12-10-2007 01:44 PM |