k2pdfopt: optimizes PDFs for viewing on e-readers - Page 29

kundor · 05-11-2013, 03:41 PM

I'm using k2pdfopt to convert a large mathematical text. On the Tesseract download page, I noticed a file "tesseract-ocr-3.02.equ.tar.gz" which says it's a "Math / equation detection module for Tesseract 3.02." This sounds like it would help to OCR the math part correctly. The majority of the text is English. Is there some way to get the OCR engine to use this, in combination with the English training data?

willus · 05-11-2013, 06:20 PM

Quote:

Originally Posted by kundor

I'm using k2pdfopt to convert a large mathematical text. On the Tesseract download page, I noticed a file "tesseract-ocr-3.02.equ.tar.gz" which says it's a "Math / equation detection module for Tesseract 3.02." This sounds like it would help to OCR the math part correctly. The majority of the text is English. Is there some way to get the OCR engine to use this, in combination with the English training data?

I have no idea on this one--I suppose I need to do some homework on Tesseract and if there is a way to use multiple training files. You might try using the math training file and just see what you get.

kundor · 05-13-2013, 03:00 PM

Thanks for answering. On reflection, I guess I'm unlikely to search for a formula, rather than text, so it doesn't really matter much. I also learned that because of Tesseract's linear design, it can't handle a lot of math notation (fractions, radicals, superscripts, subscripts, matrices, cases...) regardless of training data.

By the way, it takes about 12 hours to OCR this document, which seems kind of silly when there is already a hidden text layer. Since it includes the location data, it seems like it might be possible to keep track of which words go with each chunk while you're slicing up the pages. Have you considered doing that?

willus · 05-13-2013, 10:25 PM

Quote:

Originally Posted by kundor

Thanks for answering. On reflection, I guess I'm unlikely to search for a formula, rather than text, so it doesn't really matter much. I also learned that because of Tesseract's linear design, it can't handle a lot of math notation (fractions, radicals, superscripts, subscripts, matrices, cases...) regardless of training data.

By the way, it takes about 12 hours to OCR this document, which seems kind of silly when there is already a hidden text layer. Since it includes the location data, it seems like it might be possible to keep track of which words go with each chunk while you're slicing up the pages. Have you considered doing that?

I kind of wondered why you would want to OCR equations, but I figured maybe you wanted to search for certain symbols.

I don't know enough about how MuPDF parses PDF streams to keep track of which characters are placed where--it will take some education on my part. Certainly sounds feasible--I'll add it to my wish list. Is it possible for you to use native output at all, or do you definitely need text re-flow?

kundor · 05-19-2013, 10:03 PM

The original file is djvu, for which native output isn't supported. I actually tried to convert it to a pdf while preserving the hidden text, so that I could then use k2pdfopt in native mode to pretty it up for my kindle. I used djvu2hocr (from the ocrodjvu package) to extract the text layer. Then I should be able to use either Hocr2PDF, from ExactImage, or PDFBeads to merge it back with the images. On my MacBook, hocr2pdf produces a 1.4 mb PDF which freezes Adobe Reader and looks like gibberish in Preview.
I finally did succeed with pdfbeads, but it wasn't entirely straightforward.

On the other hand, if I reflow, k2pdfopt mangles a lot of the math formulas. I realize this might not be a high priority, but I thought I'd report it in case it would be easy or interesting to fix.
For instance, this formula:

ends up like this:

Here's a formula with superscripts and subscripts that get unaligned:

ends up as

There are also some issues with inline text, when math stuff overlaps a line.
For instance, the bottom of the fraction 7/2 here:

gets cut off and ends up floating beneath the word "unit" here:

This particular issue (usually with a "2") happens a lot in this book, I noticed.

A second issue to report: while processing a PDF file produced by k2pdfopt with Ghostscript, I get hundreds of these errors:

Code:

   **** Unknown operator: 'inf'
   **** Error reading a content stream. The page may be incomplete.
   **** File did not complete the page properly and may be damaged.

It ends up by saying:

Code:

   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> K2pdfopt v1.65 <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

So, as instructed, I'm notifying you! A little searching turned up this bug, where the Ghostscript developers say that some floating point value is being written to the PDF, but "inf" is not valid in the PDF format, even if the floating point is INF.

Sorry if I'm giving you trouble—I like the software a lot! It's attractive for mathematical use since it uses the original images, so that strange symbols and letters from lots of alphabets are always preserved. Thanks for making it.

jokodzuna · 05-23-2013, 09:08 AM

Hi, when I use k2pdfopt to optimise full e-book it reports that everything was done,but when I open the file,I find that only part of the book was optimised,the rest of it is impossible to open (says error trying to red document or something similar). It happens everytime I use it. If I only do it for a small page range - works fine. Does anyone know whats the problem? Thank you.

willus · 05-23-2013, 06:46 PM

@kundor-- Can you post an example k2pdfopt file with inf's and the source file and command that created it?

willus · 05-23-2013, 06:48 PM

Quote:

Originally Posted by jokodzuna

Hi, when I use k2pdfopt to optimise full e-book it reports that everything was done,but when I open the file,I find that only part of the book was optimised,the rest of it is impossible to open (says error trying to red document or something similar). It happens everytime I use it. If I only do it for a small page range - works fine. Does anyone know whats the problem? Thank you.

Can you post one of these "full e-books"?

willus · 05-23-2013, 11:08 PM

Quote:

Originally Posted by kundor

The original file is djvu, for which native output isn't supported. I actually tried to convert it to a pdf while preserving the hidden text, so that I could then use k2pdfopt in native mode to pretty it up for my kindle. I used djvu2hocr (from the ocrodjvu package) to extract the text layer. Then I should be able to use either Hocr2PDF, from ExactImage, or PDFBeads to merge it back with the images. On my MacBook, hocr2pdf produces a 1.4 mb PDF which freezes Adobe Reader and looks like gibberish in Preview.
I finally did succeed with pdfbeads, but it wasn't entirely straightforward.

Thank you for the links. I wasn't aware of these applications.

Quote:

Originally Posted by kundor

On the other hand, if I reflow, k2pdfopt mangles a lot of the math formulas. I realize this might not be a high priority, but I thought I'd report it in case it would be easy or interesting to fix.
For instance, this formula:

ends up like this:

Here's a formula with superscripts and subscripts that get unaligned:

ends up as

There are also some issues with inline text, when math stuff overlaps a line.
For instance, the bottom of the fraction 7/2 here:

gets cut off and ends up floating beneath the word "unit" here:

This particular issue (usually with a "2") happens a lot in this book, I noticed.

Please attach a couple example Djvu pages if you can. There are probably some settings adjustments that can be made.

Quote:

Originally Posted by kundor

A second issue to report: while processing a PDF file produced by k2pdfopt with Ghostscript, I get hundreds of these errors:

Code:

   **** Unknown operator: 'inf'
   **** Error reading a content stream. The page may be incomplete.
   **** File did not complete the page properly and may be damaged.

It ends up by saying:

Code:

   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> K2pdfopt v1.65 <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

So, as instructed, I'm notifying you! A little searching turned up this bug, where the Ghostscript developers say that some floating point value is being written to the PDF, but "inf" is not valid in the PDF format, even if the floating point is INF.

Again, please attach an example of the source file and command options that cause the generation of the bad PDF file if you can.

markom · 05-24-2013, 03:56 PM

Quote:

Originally Posted by kundor

I'm using k2pdfopt to convert a large mathematical text. On the Tesseract download page, I noticed a file "tesseract-ocr-3.02.equ.tar.gz" which says it's a "Math / equation detection module for Tesseract 3.02." This sounds like it would help to OCR the math part correctly. The majority of the text is English. Is there some way to get the OCR engine to use this, in combination with the English training data?

Have you tried out kindlepdfviewer already? it reads djvu and allows fit-to-document-width(hight), fit-to-content-width(hight) in portraite and landscape and two-point cropping.

Reflow also.

https://www.mobileread.com/forums/sho....php?p=2466450

You can also convert djvu to pdf image and then after k2pdfopt use Abbyy Finereader, Acrobat etc. for OCR-ing that k2pdfopt pdf image (in text under image mode).

OCR-ing should take about hour for detailed or half an hour for quick ocr-ing of an average book.

https://www.mobileread.com/forums/sho...&postcount=413

kundor · 05-26-2013, 02:52 PM

Hi Willus,
I've attached some djvu pages that contain the example formulas that got mangled (on the top of the first page, the bottom of the second, and the middle of the last page.)
When I run k2pdfopt on just this selection, the problem with "inf"s does not occur. But when I run "k2pdfopt -ocr" on the full source, it does. I will send you a PM.

willus · 05-26-2013, 07:45 PM

Quote:

Originally Posted by kundor

...
A second issue to report: while processing a PDF file produced by k2pdfopt with Ghostscript, I get hundreds of these errors:

Code:

   **** Unknown operator: 'inf'
   **** Error reading a content stream. The page may be incomplete.
   **** File did not complete the page properly and may be damaged.

@Kundor sent me a test case and I have been able to reproduce this bug, which is a problem in the algorithm where k2pdfopt selects the OCR font size (so it only happens when OCR is turned on). I will fix it in the next release.

curiouscat · 06-09-2013, 02:51 PM

hi, I have a Sony PRS-t2. I like the layout the k2pdfopt gives me by default but can't work out how do increase the font size of the output file. Also the option to highlight my pdf seems to disappear after conversion.

Help appreciated. Thanks

willus · 06-09-2013, 03:25 PM

Quote:

Originally Posted by curiouscat

hi, I have a Sony PRS-t2. I like the layout the k2pdfopt gives me by default but can't work out how do increase the font size of the output file. Also the option to highlight my pdf seems to disappear after conversion.

Help appreciated. Thanks

Try spending a few minutes reading the k2pdfopt FAQ page. See the second question and the second-to-last question. If you can't figure things out after that, post again.

curiouscat · 06-09-2013, 04:34 PM

Hi Willus,

Probably being thick. Already read that before I posted numerous times! I understand the principle but can't figure out the command lines. Same with the OCR I have downloaded the English file and set up the environment in the how to but I get an error saying the file cannot be opened. I created a new environment under users. Didn't know what to put in the second box so just copied yours. After that I'm stumped.

Also with more than two options configured the programme crashes. I have windows starter. I have already tried downloading and using the less aggressive version with the same results. Maybe it has to do with my incorrect configuration? Now back to latest release.

05-11-2013, 03:41 PM	#421
kundor Junior Member Posts: 5 Karma: 5998 Join Date: Oct 2011 Device: Kindle 3	Tesseract math I'm using k2pdfopt to convert a large mathematical text. On the Tesseract download page, I noticed a file "tesseract-ocr-3.02.equ.tar.gz" which says it's a "Math / equation detection module for Tesseract 3.02." This sounds like it would help to OCR the math part correctly. The majority of the text is English. Is there some way to get the OCR engine to use this, in combination with the English training data?

05-19-2013, 10:03 PM	#425
kundor Junior Member Posts: 5 Karma: 5998 Join Date: Oct 2011 Device: Kindle 3	The original file is djvu, for which native output isn't supported. I actually tried to convert it to a pdf while preserving the hidden text, so that I could then use k2pdfopt in native mode to pretty it up for my kindle. I used djvu2hocr (from the ocrodjvu package) to extract the text layer. Then I should be able to use either Hocr2PDF, from ExactImage, or PDFBeads to merge it back with the images. On my MacBook, hocr2pdf produces a 1.4 mb PDF which freezes Adobe Reader and looks like gibberish in Preview. I finally did succeed with pdfbeads, but it wasn't entirely straightforward. On the other hand, if I reflow, k2pdfopt mangles a lot of the math formulas. I realize this might not be a high priority, but I thought I'd report it in case it would be easy or interesting to fix. For instance, this formula: ends up like this: Here's a formula with superscripts and subscripts that get unaligned: ends up as There are also some issues with inline text, when math stuff overlaps a line. For instance, the bottom of the fraction 7/2 here: gets cut off and ends up floating beneath the word "unit" here: This particular issue (usually with a "2") happens a lot in this book, I noticed. A second issue to report: while processing a PDF file produced by k2pdfopt with Ghostscript, I get hundreds of these errors: Code: ** Unknown operator: 'inf' Error reading a content stream. The page may be incomplete. File did not complete the page properly and may be damaged. It ends up by saying: Code: This file had errors that were repaired or ignored. The file was produced by: >>>> K2pdfopt v1.65 <<<< Please notify the author of the software that produced this file that it does not conform to Adobe's published PDF ** specification. So, as instructed, I'm notifying you! A little searching turned up this bug, where the Ghostscript developers say that some floating point value is being written to the PDF, but "inf" is not valid in the PDF format, even if the floating point is INF. Sorry if I'm giving you trouble—I like the software a lot! It's attractive for mathematical use since it uses the original images, so that strange symbols and letters from lots of alphabets are always preserved. Thanks for making it.

05-23-2013, 09:08 AM	#426
jokodzuna Junior Member Posts: 1 Karma: 10 Join Date: May 2013 Device: kindle	optimises only part of a book Hi, when I use k2pdfopt to optimise full e-book it reports that everything was done,but when I open the file,I find that only part of the book was optimised,the rest of it is impossible to open (says error trying to red document or something similar). It happens everytime I use it. If I only do it for a small page range - works fine. Does anyone know whats the problem? Thank you.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Viewing PDFs with another font	Font	PocketBook	4	11-12-2010 08:27 AM
Viewing Textbook PDFs...	NJReader	enTourage Archive	4	08-17-2010 05:17 PM
PRS-600 Restart bug while viewing PDFs?	conundrum	Sony Reader	2	03-04-2010 08:46 PM
More on viewing pdfs	dso371	Bookeen	8	03-11-2008 07:15 PM
Viewing Untagged PDFs on Palm T\|X	Eroica	Reading and Management	3	12-10-2007 01:44 PM

05-13-2013, 03:00 PM	#423
kundor Junior Member Posts: 5 Karma: 5998 Join Date: Oct 2011 Device: Kindle 3	Thanks for answering. On reflection, I guess I'm unlikely to search for a formula, rather than text, so it doesn't really matter much. I also learned that because of Tesseract's linear design, it can't handle a lot of math notation (fractions, radicals, superscripts, subscripts, matrices, cases...) regardless of training data. By the way, it takes about 12 hours to OCR this document, which seems kind of silly when there is already a hidden text layer. Since it includes the location data, it seems like it might be possible to keep track of which words go with each chunk while you're slicing up the pages. Have you considered doing that?

05-23-2013, 06:46 PM	#427
willus Fuzzball, the purple cat Posts: 1,282 Karma: 11087488 Join Date: Jun 2011 Location: California Device: iPad	@kundor-- Can you post an example k2pdfopt file with inf's and the source file and command that created it?

06-09-2013, 02:51 PM	#433
curiouscat Enthusiast Posts: 26 Karma: 11998 Join Date: Jun 2013 Location: UK Device: Kindle Oasis	hi, I have a Sony PRS-t2. I like the layout the k2pdfopt gives me by default but can't work out how do increase the font size of the output file. Also the option to highlight my pdf seems to disappear after conversion. Help appreciated. Thanks

06-09-2013, 04:34 PM	#435
curiouscat Enthusiast Posts: 26 Karma: 11998 Join Date: Jun 2013 Location: UK Device: Kindle Oasis	Hi Willus, Probably being thick. Already read that before I posted numerous times! I understand the principle but can't figure out the command lines. Same with the OCR I have downloaded the English file and set up the environment in the how to but I get an error saying the file cannot be opened. I created a new environment under users. Didn't know what to put in the second box so just copied yours. After that I'm stumped. Also with more than two options configured the programme crashes. I have windows starter. I have already tried downloading and using the less aggressive version with the same results. Maybe it has to do with my incorrect configuration? Now back to latest release.