View Single Post
Old 05-13-2013, 10:25 PM   #424
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 526
Karma: 2526455
Join Date: Jun 2011
Location: California
Device: Kindle 2, iPad
Quote:
Originally Posted by kundor View Post
Thanks for answering. On reflection, I guess I'm unlikely to search for a formula, rather than text, so it doesn't really matter much. I also learned that because of Tesseract's linear design, it can't handle a lot of math notation (fractions, radicals, superscripts, subscripts, matrices, cases...) regardless of training data.

By the way, it takes about 12 hours to OCR this document, which seems kind of silly when there is already a hidden text layer. Since it includes the location data, it seems like it might be possible to keep track of which words go with each chunk while you're slicing up the pages. Have you considered doing that?
I kind of wondered why you would want to OCR equations, but I figured maybe you wanted to search for certain symbols.

I don't know enough about how MuPDF parses PDF streams to keep track of which characters are placed where--it will take some education on my part. Certainly sounds feasible--I'll add it to my wish list. Is it possible for you to use native output at all, or do you definitely need text re-flow?
willus is offline   Reply With Quote