MobileRead Forums - View Single Post - commercial software for Kindle/mobi from scanned math texts

Tex2002ans · 12-09-2013, 06:01 PM

Quote:

Originally Posted by Hitch

[...] but MathML doesn't really save you on the creation, unless you disagree?

My "Step 1" in the Tutorial could be: LibreOffice Math, LaTeX, MathML, InDesign, Inkscape, codecogs, ...

There is an assortment of ways to digitize the formulas.... but yes, as you say, it would still require all the manpower to actually convert the formulas to their digital equivalents.

The important thing I see for these books is actually having all the formulas vectorized. This is a HUGE advantage in the long-term of the book (instead of snapshots right out of a scan). Once you get the hard part done (getting it into an actual digital form), it would be easy FROM THAT POINT FORWARD to generate the formulas in whatever font/size/format you want.

You can see at the end of my tutorial, the comparison EPUB, and you can see that the digitally created images are WAY cleaner/nicer looking than the "straight from scan" formulas. Once you get a taste of the good stuff, you will never go back!

Having the formulas vectorized will only get better for the long-term of the book. When EPUB3 becomes more standard (MathML becomes more ubiquitous), Amazon starts accepting SVG/MathML, the next generation ebook formats come along, etc. etc.

Case #1: Let us say you take snapshots of the formulas right out of the scan. Now let us say these super high resolution Amazon devices come out (see Kindle HDX). All of these low resolution images might as well go right into the trash, because they appear as unreadable postage stamps. Now you wanted to fix the book, you would have to pay to have someone slog through the scans all over again, and take higher resolution snapshots!

Case #2: You initially suffer through all that pain of conversion to vector, now it will just be going back to my vector files and just export all the images at higher resolution (or export as MathML, or export as SVG, or export as format XYZ). Replace all images. BAM, new higher quality book, with barely any extra labor!

Quote:

Originally Posted by Hitch

Well, that's why I think even discussing MathML is a bit...fruitless.

Indeed.. it is one of those things where they could:

Case #1: Pay the Indian conversion company (pennies) to just do the crappy images and get hideous subpar output.

Case #2: Pay a nice chunk of change for conversion to vector formulas.

Vectorizing gives you the advantage in the present of having much cleaner images... but you won't really see the serious payoffs of vectorizing until MathML/vector support becomes more ubiquitous... and devices that come out that are higher and higher resolution.

I tend to think nice and long-term. Pay more up front to have it done RIGHT, and you will pay much less for maintenance in the long-run.

Vectorizing everything also gets you part of the way there if you ever want to go BACKWARDS from EPUB/HTML -> Print. This is an area where I am currently researching. So if you wanted to print a new edition of the book, this would be the way to go!

I would NEVER touch an actual math book that was scanned though... I value my time/hair too highly. I am ok with pulling my hair out on the very occasional book with 10 formulas.

Quote:

Originally Posted by Hitch

(I don't get to play with MathML much, as we never, ever get source, so...)

This is what really baffles me! You are telling me your company has never received InDesign/Quark/LaTeX/WordPerfect files? I mean, these are publishers coming TO YOU for conversion... you would think they have all of the source documents.

I recently was able to convert a book that was designed in WordPerfect (I contacted the author, who contacted the publisher, who gave me the source documents). I was able to open up the .wp in LibreOffice, use my RegexFu to convert it to super clean HTML, and I was able to get the EPUB up and running in no time!

So if a publisher comes to you with a new book they just designed in InDesign/Quark/WordPerfect, do you request them to export as EPUB/HTML? Or just hand over as the PDF and you work from there?

I was HORRIFIED when I learned that many of these conversion companies accept PDF ONLY! The conversion process would be:

Method #1: InDesign -> PDF -> OCR (errors/typos introduced) -> EPUB -> clean up.

Instead of doing the sensible thing when publishing a perfectly new book:

Method #2: InDesign/Source Document -> HTML/EPUB -> clean up.

So before I came on the scene to convert EPUBs for our teeny weeny non-profit publishing... instead, they were paying to get EPUBs full of needless typos/mistakes!!!

I imagine lots of other smaller publishers (non-profits, and hell, even small for-profits) are in the same situation! Think of all that manpower being wasted! And OCR just introduces so much needless errors! I mean, WHY WOULD YOU WASTE TIME OCRing SOMETHING WHEN YOU ALREADY HAVE THE PURELY DIGITAL SOURCE DOCUMENTS!!!!

Quote:

Originally Posted by Hitch

I'd say that it's possible that some of the equations were output from MathML...but many of the tabular equations are, I believe, embedded in there as images. Perhaps those were exported from MathML as-is? Don't know. I'd be very interested, Tex, in your thoughts. But I guarantee you, this entire book cost them every penny, in manhours in admin and labor, of $20K.

I don't know, I wish I worked behind the scenes at some large publisher, so I could see how things are done at the big boys.

I rarely get to work on new books, most of my work is working on old scans (or PDFs which were created in the last 20 years, but the digital source is gone/lost in the abyss). But when I do work from a purely digital source, I whip those things out within a few hours, and think how wonderful my life would be without having to work backwards from the dreaded PDF.

From what I gather, a larger publisher WOULD be designing these documents with long-term in mind, so they would put all their forumlas in LaTeX or MathML or SVG or PDF or EPS or AI. Then they have a system in place to just auto-export everything.

But yeah, the amount of money/manpower spent actually typesetting/designing these books is immense. And then us converters just get the little penny scraps (as you say, for cheaper than a dinner for two).

Quote:

Originally Posted by Hitch

Yes. We've made books with 468 images, and those weren't teeny-weeny things, and they were helpfully provided by the client. Still...craploads of manhours. Craploads. And we spend forever doing various compression algorithms, etc., trying to make it "fit" in 20MB at Nook and 50 at Amazon. FOR-ever.

Heh, heh... And introducing an outsider to the workflow, ugh... who knows how they generated the images, and of course, after you are at image #30, you notice something wrong with the way they exported, and you have to go back and redo all the images! I shudder to think how that would be handled with an outsider.

I am reminded about a long-running project I have going on with a Quarterly Journal we publish. I have been asking for all of the source files every quarter (so they don't get lost down the memory hole, I want to save any future selves from having to suffer through working backwards from a PDF!).

I have all of the tables/charts/graphs as .ai ... but Inkscape doesn't import/export them properly (the kerning in some of the fonts is off, the charts aren't bad, but this is especially noticeable in the Tables... think it might just be as easy as a font situation).

I don't have access to Illustrator, so I asked for some help from an amazing MR user... and he was kind enough to spend his time to export them for me! I was so ecstatic, and then after I took a closer look, I noticed that all of the table captions had the text included in the image.... I felt absolutely HORRIBLE.