Originally Posted by kundor
Thanks for answering. On reflection, I guess I'm unlikely to search for a formula, rather than text, so it doesn't really matter much. I also learned that because of Tesseract's linear design, it can't handle a lot of math notation (fractions, radicals, superscripts, subscripts, matrices, cases...) regardless of training data.
By the way, it takes about 12 hours to OCR this document, which seems kind of silly when there is already a hidden text layer. Since it includes the location data, it seems like it might be possible to keep track of which words go with each chunk while you're slicing up the pages. Have you considered doing that?
I kind of wondered why you would want to OCR equations, but I figured maybe you wanted to search for certain symbols.
I don't know enough about how MuPDF parses PDF streams to keep track of which characters are placed where--it will take some education on my part. Certainly sounds feasible--I'll add it to my wish list. Is it possible for you to use native output
at all, or do you definitely need text re-flow?