05-11-2013, 03:41 PM | #421 |
Junior Member
Posts: 5
Karma: 5998
Join Date: Oct 2011
Device: Kindle 3
|
Tesseract math
I'm using k2pdfopt to convert a large mathematical text. On the Tesseract download page, I noticed a file "tesseract-ocr-3.02.equ.tar.gz" which says it's a "Math / equation detection module for Tesseract 3.02." This sounds like it would help to OCR the math part correctly. The majority of the text is English. Is there some way to get the OCR engine to use this, in combination with the English training data?
|
05-11-2013, 06:20 PM | #422 | |
Fuzzball, the purple cat
Posts: 1,282
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
|
|
05-13-2013, 03:00 PM | #423 |
Junior Member
Posts: 5
Karma: 5998
Join Date: Oct 2011
Device: Kindle 3
|
Thanks for answering. On reflection, I guess I'm unlikely to search for a formula, rather than text, so it doesn't really matter much. I also learned that because of Tesseract's linear design, it can't handle a lot of math notation (fractions, radicals, superscripts, subscripts, matrices, cases...) regardless of training data.
By the way, it takes about 12 hours to OCR this document, which seems kind of silly when there is already a hidden text layer. Since it includes the location data, it seems like it might be possible to keep track of which words go with each chunk while you're slicing up the pages. Have you considered doing that? |
05-13-2013, 10:25 PM | #424 | |
Fuzzball, the purple cat
Posts: 1,282
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
I don't know enough about how MuPDF parses PDF streams to keep track of which characters are placed where--it will take some education on my part. Certainly sounds feasible--I'll add it to my wish list. Is it possible for you to use native output at all, or do you definitely need text re-flow? |
|
05-19-2013, 10:03 PM | #425 |
Junior Member
Posts: 5
Karma: 5998
Join Date: Oct 2011
Device: Kindle 3
|
The original file is djvu, for which native output isn't supported. I actually tried to convert it to a pdf while preserving the hidden text, so that I could then use k2pdfopt in native mode to pretty it up for my kindle. I used djvu2hocr (from the ocrodjvu package) to extract the text layer. Then I should be able to use either Hocr2PDF, from ExactImage, or PDFBeads to merge it back with the images. On my MacBook, hocr2pdf produces a 1.4 mb PDF which freezes Adobe Reader and looks like gibberish in Preview.
I finally did succeed with pdfbeads, but it wasn't entirely straightforward. On the other hand, if I reflow, k2pdfopt mangles a lot of the math formulas. I realize this might not be a high priority, but I thought I'd report it in case it would be easy or interesting to fix. For instance, this formula: ends up like this: Here's a formula with superscripts and subscripts that get unaligned: ends up as There are also some issues with inline text, when math stuff overlaps a line. For instance, the bottom of the fraction 7/2 here: gets cut off and ends up floating beneath the word "unit" here: This particular issue (usually with a "2") happens a lot in this book, I noticed. A second issue to report: while processing a PDF file produced by k2pdfopt with Ghostscript, I get hundreds of these errors: Code:
**** Unknown operator: 'inf' **** Error reading a content stream. The page may be incomplete. **** File did not complete the page properly and may be damaged. Code:
**** This file had errors that were repaired or ignored. **** The file was produced by: **** >>>> K2pdfopt v1.65 <<<< **** Please notify the author of the software that produced this **** file that it does not conform to Adobe's published PDF **** specification. Sorry if I'm giving you trouble—I like the software a lot! It's attractive for mathematical use since it uses the original images, so that strange symbols and letters from lots of alphabets are always preserved. Thanks for making it. |
05-23-2013, 09:08 AM | #426 |
Junior Member
Posts: 1
Karma: 10
Join Date: May 2013
Device: kindle
|
optimises only part of a book
Hi, when I use k2pdfopt to optimise full e-book it reports that everything was done,but when I open the file,I find that only part of the book was optimised,the rest of it is impossible to open (says error trying to red document or something similar). It happens everytime I use it. If I only do it for a small page range - works fine. Does anyone know whats the problem? Thank you.
|
05-23-2013, 06:46 PM | #427 |
Fuzzball, the purple cat
Posts: 1,282
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
@kundor-- Can you post an example k2pdfopt file with inf's and the source file and command that created it?
|
05-23-2013, 06:48 PM | #428 | |
Fuzzball, the purple cat
Posts: 1,282
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
|
|
05-23-2013, 11:08 PM | #429 | |||
Fuzzball, the purple cat
Posts: 1,282
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
Quote:
Quote:
|
|||
05-24-2013, 03:56 PM | #430 | |
Banned
Posts: 488
Karma: 1080260
Join Date: Sep 2012
Device: sony prs t1 kindle dx ipad
|
Quote:
Reflow also. https://www.mobileread.com/forums/sho....php?p=2466450 You can also convert djvu to pdf image and then after k2pdfopt use Abbyy Finereader, Acrobat etc. for OCR-ing that k2pdfopt pdf image (in text under image mode). OCR-ing should take about hour for detailed or half an hour for quick ocr-ing of an average book. https://www.mobileread.com/forums/sho...&postcount=413 Last edited by markom; 05-24-2013 at 04:57 PM. |
|
05-26-2013, 02:52 PM | #431 |
Junior Member
Posts: 5
Karma: 5998
Join Date: Oct 2011
Device: Kindle 3
|
Hi Willus,
I've attached some djvu pages that contain the example formulas that got mangled (on the top of the first page, the bottom of the second, and the middle of the last page.) When I run k2pdfopt on just this selection, the problem with "inf"s does not occur. But when I run "k2pdfopt -ocr" on the full source, it does. I will send you a PM. |
05-26-2013, 07:45 PM | #432 | |
Fuzzball, the purple cat
Posts: 1,282
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
|
|
06-09-2013, 02:51 PM | #433 |
Enthusiast
Posts: 26
Karma: 11998
Join Date: Jun 2013
Location: UK
Device: Kindle Oasis
|
hi, I have a Sony PRS-t2. I like the layout the k2pdfopt gives me by default but can't work out how do increase the font size of the output file. Also the option to highlight my pdf seems to disappear after conversion.
Help appreciated. Thanks |
06-09-2013, 03:25 PM | #434 | |
Fuzzball, the purple cat
Posts: 1,282
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
|
|
06-09-2013, 04:34 PM | #435 |
Enthusiast
Posts: 26
Karma: 11998
Join Date: Jun 2013
Location: UK
Device: Kindle Oasis
|
Hi Willus,
Probably being thick. Already read that before I posted numerous times! I understand the principle but can't figure out the command lines. Same with the OCR I have downloaded the English file and set up the environment in the how to but I get an error saying the file cannot be opened. I created a new environment under users. Didn't know what to put in the second box so just copied yours. After that I'm stumped. Also with more than two options configured the programme crashes. I have windows starter. I have already tried downloading and using the less aggressive version with the same results. Maybe it has to do with my incorrect configuration? Now back to latest release. |
Tags |
ebook apps, k5 tools, kindle tools, kindle touch, tools |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Viewing PDFs with another font | Font | PocketBook | 4 | 11-12-2010 08:27 AM |
Viewing Textbook PDFs... | NJReader | enTourage Archive | 4 | 08-17-2010 05:17 PM |
PRS-600 Restart bug while viewing PDFs? | conundrum | Sony Reader | 2 | 03-04-2010 08:46 PM |
More on viewing pdfs | dso371 | Bookeen | 8 | 03-11-2008 07:15 PM |
Viewing Untagged PDFs on Palm T|X | Eroica | Reading and Management | 3 | 12-10-2007 01:44 PM |