Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 05-11-2013, 03:41 PM   #421
kundor
Junior Member
kundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toys
 
Posts: 5
Karma: 5998
Join Date: Oct 2011
Device: Kindle 3
Tesseract math

I'm using k2pdfopt to convert a large mathematical text. On the Tesseract download page, I noticed a file "tesseract-ocr-3.02.equ.tar.gz" which says it's a "Math / equation detection module for Tesseract 3.02." This sounds like it would help to OCR the math part correctly. The majority of the text is English. Is there some way to get the OCR engine to use this, in combination with the English training data?
kundor is offline   Reply With Quote
Old 05-11-2013, 06:20 PM   #422
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by kundor View Post
I'm using k2pdfopt to convert a large mathematical text. On the Tesseract download page, I noticed a file "tesseract-ocr-3.02.equ.tar.gz" which says it's a "Math / equation detection module for Tesseract 3.02." This sounds like it would help to OCR the math part correctly. The majority of the text is English. Is there some way to get the OCR engine to use this, in combination with the English training data?
I have no idea on this one--I suppose I need to do some homework on Tesseract and if there is a way to use multiple training files. You might try using the math training file and just see what you get.
willus is offline   Reply With Quote
Advert
Old 05-13-2013, 03:00 PM   #423
kundor
Junior Member
kundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toys
 
Posts: 5
Karma: 5998
Join Date: Oct 2011
Device: Kindle 3
Thanks for answering. On reflection, I guess I'm unlikely to search for a formula, rather than text, so it doesn't really matter much. I also learned that because of Tesseract's linear design, it can't handle a lot of math notation (fractions, radicals, superscripts, subscripts, matrices, cases...) regardless of training data.

By the way, it takes about 12 hours to OCR this document, which seems kind of silly when there is already a hidden text layer. Since it includes the location data, it seems like it might be possible to keep track of which words go with each chunk while you're slicing up the pages. Have you considered doing that?
kundor is offline   Reply With Quote
Old 05-13-2013, 10:25 PM   #424
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by kundor View Post
Thanks for answering. On reflection, I guess I'm unlikely to search for a formula, rather than text, so it doesn't really matter much. I also learned that because of Tesseract's linear design, it can't handle a lot of math notation (fractions, radicals, superscripts, subscripts, matrices, cases...) regardless of training data.

By the way, it takes about 12 hours to OCR this document, which seems kind of silly when there is already a hidden text layer. Since it includes the location data, it seems like it might be possible to keep track of which words go with each chunk while you're slicing up the pages. Have you considered doing that?
I kind of wondered why you would want to OCR equations, but I figured maybe you wanted to search for certain symbols.

I don't know enough about how MuPDF parses PDF streams to keep track of which characters are placed where--it will take some education on my part. Certainly sounds feasible--I'll add it to my wish list. Is it possible for you to use native output at all, or do you definitely need text re-flow?
willus is offline   Reply With Quote
Old 05-19-2013, 10:03 PM   #425
kundor
Junior Member
kundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toys
 
Posts: 5
Karma: 5998
Join Date: Oct 2011
Device: Kindle 3
The original file is djvu, for which native output isn't supported. I actually tried to convert it to a pdf while preserving the hidden text, so that I could then use k2pdfopt in native mode to pretty it up for my kindle. I used djvu2hocr (from the ocrodjvu package) to extract the text layer. Then I should be able to use either Hocr2PDF, from ExactImage, or PDFBeads to merge it back with the images. On my MacBook, hocr2pdf produces a 1.4 mb PDF which freezes Adobe Reader and looks like gibberish in Preview.
I finally did succeed with pdfbeads, but it wasn't entirely straightforward.



On the other hand, if I reflow, k2pdfopt mangles a lot of the math formulas. I realize this might not be a high priority, but I thought I'd report it in case it would be easy or interesting to fix.
For instance, this formula:

ends up like this:

Here's a formula with superscripts and subscripts that get unaligned:

ends up as

There are also some issues with inline text, when math stuff overlaps a line.
For instance, the bottom of the fraction 7/2 here:

gets cut off and ends up floating beneath the word "unit" here:

This particular issue (usually with a "2") happens a lot in this book, I noticed.



A second issue to report: while processing a PDF file produced by k2pdfopt with Ghostscript, I get hundreds of these errors:
Code:
   **** Unknown operator: 'inf'
   **** Error reading a content stream. The page may be incomplete.
   **** File did not complete the page properly and may be damaged.
It ends up by saying:
Code:
   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> K2pdfopt v1.65 <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.
So, as instructed, I'm notifying you! A little searching turned up this bug, where the Ghostscript developers say that some floating point value is being written to the PDF, but "inf" is not valid in the PDF format, even if the floating point is INF.



Sorry if I'm giving you trouble—I like the software a lot! It's attractive for mathematical use since it uses the original images, so that strange symbols and letters from lots of alphabets are always preserved. Thanks for making it.
kundor is offline   Reply With Quote
Advert
Old 05-23-2013, 09:08 AM   #426
jokodzuna
Junior Member
jokodzuna began at the beginning.
 
Posts: 1
Karma: 10
Join Date: May 2013
Device: kindle
optimises only part of a book

Hi, when I use k2pdfopt to optimise full e-book it reports that everything was done,but when I open the file,I find that only part of the book was optimised,the rest of it is impossible to open (says error trying to red document or something similar). It happens everytime I use it. If I only do it for a small page range - works fine. Does anyone know whats the problem? Thank you.
jokodzuna is offline   Reply With Quote
Old 05-23-2013, 06:46 PM   #427
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
@kundor-- Can you post an example k2pdfopt file with inf's and the source file and command that created it?
willus is offline   Reply With Quote
Old 05-23-2013, 06:48 PM   #428
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by jokodzuna View Post
Hi, when I use k2pdfopt to optimise full e-book it reports that everything was done,but when I open the file,I find that only part of the book was optimised,the rest of it is impossible to open (says error trying to red document or something similar). It happens everytime I use it. If I only do it for a small page range - works fine. Does anyone know whats the problem? Thank you.
Can you post one of these "full e-books"?
willus is offline   Reply With Quote
Old 05-23-2013, 11:08 PM   #429
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by kundor View Post
The original file is djvu, for which native output isn't supported. I actually tried to convert it to a pdf while preserving the hidden text, so that I could then use k2pdfopt in native mode to pretty it up for my kindle. I used djvu2hocr (from the ocrodjvu package) to extract the text layer. Then I should be able to use either Hocr2PDF, from ExactImage, or PDFBeads to merge it back with the images. On my MacBook, hocr2pdf produces a 1.4 mb PDF which freezes Adobe Reader and looks like gibberish in Preview.
I finally did succeed with pdfbeads, but it wasn't entirely straightforward.
Thank you for the links. I wasn't aware of these applications.

Quote:
Originally Posted by kundor View Post


On the other hand, if I reflow, k2pdfopt mangles a lot of the math formulas. I realize this might not be a high priority, but I thought I'd report it in case it would be easy or interesting to fix.
For instance, this formula:

ends up like this:

Here's a formula with superscripts and subscripts that get unaligned:

ends up as

There are also some issues with inline text, when math stuff overlaps a line.
For instance, the bottom of the fraction 7/2 here:

gets cut off and ends up floating beneath the word "unit" here:

This particular issue (usually with a "2") happens a lot in this book, I noticed.
Please attach a couple example Djvu pages if you can. There are probably some settings adjustments that can be made.

Quote:
Originally Posted by kundor View Post


A second issue to report: while processing a PDF file produced by k2pdfopt with Ghostscript, I get hundreds of these errors:
Code:
   **** Unknown operator: 'inf'
   **** Error reading a content stream. The page may be incomplete.
   **** File did not complete the page properly and may be damaged.
It ends up by saying:
Code:
   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> K2pdfopt v1.65 <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.
So, as instructed, I'm notifying you! A little searching turned up this bug, where the Ghostscript developers say that some floating point value is being written to the PDF, but "inf" is not valid in the PDF format, even if the floating point is INF.
Again, please attach an example of the source file and command options that cause the generation of the bad PDF file if you can.
willus is offline   Reply With Quote
Old 05-24-2013, 03:56 PM   #430
markom
Banned
markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.
 
Posts: 488
Karma: 1080260
Join Date: Sep 2012
Device: sony prs t1 kindle dx ipad
Quote:
Originally Posted by kundor View Post
I'm using k2pdfopt to convert a large mathematical text. On the Tesseract download page, I noticed a file "tesseract-ocr-3.02.equ.tar.gz" which says it's a "Math / equation detection module for Tesseract 3.02." This sounds like it would help to OCR the math part correctly. The majority of the text is English. Is there some way to get the OCR engine to use this, in combination with the English training data?
Have you tried out kindlepdfviewer already? it reads djvu and allows fit-to-document-width(hight), fit-to-content-width(hight) in portraite and landscape and two-point cropping.

Reflow also.

https://www.mobileread.com/forums/sho....php?p=2466450

You can also convert djvu to pdf image and then after k2pdfopt use Abbyy Finereader, Acrobat etc. for OCR-ing that k2pdfopt pdf image (in text under image mode).

OCR-ing should take about hour for detailed or half an hour for quick ocr-ing of an average book.

https://www.mobileread.com/forums/sho...&postcount=413

Last edited by markom; 05-24-2013 at 04:57 PM.
markom is offline   Reply With Quote
Old 05-26-2013, 02:52 PM   #431
kundor
Junior Member
kundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toyskundor shares his or her toys
 
Posts: 5
Karma: 5998
Join Date: Oct 2011
Device: Kindle 3
Hi Willus,
I've attached some djvu pages that contain the example formulas that got mangled (on the top of the first page, the bottom of the second, and the middle of the last page.)
When I run k2pdfopt on just this selection, the problem with "inf"s does not occur. But when I run "k2pdfopt -ocr" on the full source, it does. I will send you a PM.
Attached Files
File Type: zip Selection.djvu.zip (53.8 KB, 293 views)
kundor is offline   Reply With Quote
Old 05-26-2013, 07:45 PM   #432
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by kundor View Post
...
A second issue to report: while processing a PDF file produced by k2pdfopt with Ghostscript, I get hundreds of these errors:
Code:
   **** Unknown operator: 'inf'
   **** Error reading a content stream. The page may be incomplete.
   **** File did not complete the page properly and may be damaged.
@Kundor sent me a test case and I have been able to reproduce this bug, which is a problem in the algorithm where k2pdfopt selects the OCR font size (so it only happens when OCR is turned on). I will fix it in the next release.
willus is offline   Reply With Quote
Old 06-09-2013, 02:51 PM   #433
curiouscat
Enthusiast
curiouscat is a marvel to beholdcuriouscat is a marvel to beholdcuriouscat is a marvel to beholdcuriouscat is a marvel to beholdcuriouscat is a marvel to beholdcuriouscat is a marvel to beholdcuriouscat is a marvel to beholdcuriouscat is a marvel to beholdcuriouscat is a marvel to beholdcuriouscat is a marvel to beholdcuriouscat is a marvel to behold
 
Posts: 26
Karma: 11998
Join Date: Jun 2013
Location: UK
Device: Kindle Oasis
hi, I have a Sony PRS-t2. I like the layout the k2pdfopt gives me by default but can't work out how do increase the font size of the output file. Also the option to highlight my pdf seems to disappear after conversion.

Help appreciated. Thanks
curiouscat is offline   Reply With Quote
Old 06-09-2013, 03:25 PM   #434
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by curiouscat View Post
hi, I have a Sony PRS-t2. I like the layout the k2pdfopt gives me by default but can't work out how do increase the font size of the output file. Also the option to highlight my pdf seems to disappear after conversion.

Help appreciated. Thanks
Try spending a few minutes reading the k2pdfopt FAQ page. See the second question and the second-to-last question. If you can't figure things out after that, post again.
willus is offline   Reply With Quote
Old 06-09-2013, 04:34 PM   #435
curiouscat
Enthusiast
curiouscat is a marvel to beholdcuriouscat is a marvel to beholdcuriouscat is a marvel to beholdcuriouscat is a marvel to beholdcuriouscat is a marvel to beholdcuriouscat is a marvel to beholdcuriouscat is a marvel to beholdcuriouscat is a marvel to beholdcuriouscat is a marvel to beholdcuriouscat is a marvel to beholdcuriouscat is a marvel to behold
 
Posts: 26
Karma: 11998
Join Date: Jun 2013
Location: UK
Device: Kindle Oasis
Hi Willus,

Probably being thick. Already read that before I posted numerous times! I understand the principle but can't figure out the command lines. Same with the OCR I have downloaded the English file and set up the environment in the how to but I get an error saying the file cannot be opened. I created a new environment under users. Didn't know what to put in the second box so just copied yours. After that I'm stumped.

Also with more than two options configured the programme crashes. I have windows starter. I have already tried downloading and using the less aggressive version with the same results. Maybe it has to do with my incorrect configuration? Now back to latest release.
curiouscat is offline   Reply With Quote
Reply

Tags
ebook apps, k5 tools, kindle tools, kindle touch, tools

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Viewing PDFs with another font Font PocketBook 4 11-12-2010 08:27 AM
Viewing Textbook PDFs... NJReader enTourage Archive 4 08-17-2010 05:17 PM
PRS-600 Restart bug while viewing PDFs? conundrum Sony Reader 2 03-04-2010 08:46 PM
More on viewing pdfs dso371 Bookeen 8 03-11-2008 07:15 PM
Viewing Untagged PDFs on Palm T|X Eroica Reading and Management 3 12-10-2007 01:44 PM


All times are GMT -4. The time now is 04:00 AM.


MobileRead.com is a privately owned, operated and funded community.