commercial software for Kindle/mobi from scanned math texts

xristy · 12-05-2013, 04:15 AM

Hello,

There are a number of Kindle math books - from Dover, for example, that appear to have been converted from scans to Kindle/mobi.

Usually the display equations are images and sometimes inline math content is converted to some font and sometime appear as images.

I am wondering what sort of commercial software is used to produce such eBooks. In other words how do companies that produce eBooks for a living generate these sort of eBooks from "legacy" texts.

There are not so many ePubs - and I haven't yet seen any Dover eBooks other than Kindle from Amazon.

Thanks

Hitch · 12-09-2013, 02:42 AM

Quote:

Originally Posted by xristy

Hello,

There are a number of Kindle math books - from Dover, for example, that appear to have been converted from scans to Kindle/mobi.

Usually the display equations are images and sometimes inline math content is converted to some font and sometime appear as images.

I am wondering what sort of commercial software is used to produce such eBooks. In other words how do companies that produce eBooks for a living generate these sort of eBooks from "legacy" texts.

There are not so many ePubs - and I haven't yet seen any Dover eBooks other than Kindle from Amazon.

Thanks

Well, I hate to be the bearer of bad tidings,

but there's no such thing as "commercial software" that is used to produce such ebooks. What produces such ebooks is either scanning (from print) or OCR (from PDF's) or both (print), and/or human labor. When we make books like that, we have to take screenshots of every single equation and create it as an image. Embedding the font is useless, for Amazon, because none of the millions of K7-Kindles out there will render it. Therefore, all mathematical equations, to work on any Amazon device, have to be images.

That isn't just true of Amazon, either; the big Algebra book that everybody goes on about, that Apple points to as the darling golden-haired child of iAuthor? Every single mathematical formulae in there is an image. Every single one. Book must have cost upwards of $20K to make, I'd guess, in (offshore) manhours alone.

Most of these books are shopped to India, to take advantage of the cheap labor resources there, because these types of books are so labor-intensive (all the screenshots). There are some programs that make it faster to do, but it still all has to be done by hand, basically. (Making decisions about which screenshots, what size each is, making them consistent so that they size the same, etc.)

I don't know of any other way to make them. Some of the guys here will speak-a da Latex, but at the end of the day, once you get to ePUB or MOBI, the only reliable way to display mathematical symbols/equations, etc., is by using images. Still.

Hitch

Toxaris · 12-09-2013, 04:12 AM

You can also create equations in SVG, but I am not aware if that is supported on the Kindle versions.

Tex2002ans · 12-09-2013, 01:54 PM

Quote:

Originally Posted by xristy

There are a number of Kindle math books - from Dover, for example, that appear to have been converted from scans to Kindle/mobi.

Mind linking to an example?

Quote:

Originally Posted by xristy

I am wondering what sort of commercial software is used to produce such eBooks. In other words how do companies that produce eBooks for a living generate these sort of eBooks from "legacy" texts.

As was stated, most of this is made by hand. If it is built from an actual SCAN, the only way you can do it is extremely labor intensive snapshots (as Hitch mentioned).

I showed how I recreate A FEW (and I stress A FEW) formulas in books:

https://www.mobileread.com/forums/sho...d.php?t=223254

In that thread, Toxaris also pointed to this site which is not as customizable as my method, but is a lot less labor intensive:

https://www.mobileread.com/forums/sho...99&postcount=5

If the conversion company is deriving the book from an actual SOURCE document, the conversion would be infinitely easier.... but still a complete pain.

The current state of the EPUB/MOBI market is not designed well for extremely complex mathematical books.

Quote:

Originally Posted by xristy

There are not so many ePubs - and I haven't yet seen any Dover eBooks other than Kindle from Amazon.

EPUB would be able to handle this easier than Amazon. You would at least be able to use SVG or MathML... but again, if you are working from SCANS, you have to manually recreate all those algorithms.

MathML is not supported on many devices at all, and only a few readers can use it (See MathJax: http://www.mathjax.org/resources/epub-readers/) and good luck trying to sell that anywhere besides your own dedicated store.

Quote:

Originally Posted by Toxaris

You can also create equations in SVG, but I am not aware if that is supported on the Kindle versions.

Older Kindles can't handle SVG, and according to the Kindle Guidelines when you use SVG, you need to include an image fallback. See Section 8.4.2:

kindlegen.s3.amazonaws.com/AmazonKindlePublishingGuidelines.pdf

I think that sort of defeats a lot of the purpose, since you would have to double-up on every single formula in the book (SVG + PNG/JPG). With such an image heavy book, I think this would start cutting into those distribution fees that Amazon/other stores charge for really large books.

Quote:

Originally Posted by Hitch

That isn't just true of Amazon, either; the big Algebra book that everybody goes on about, that Apple points to as the darling golden-haired child of iAuthor? Every single mathematical formulae in there is an image. Every single one. Book must have cost upwards of $20K to make, I'd guess, in (offshore) manhours alone.

Mind posting a link to this sample?

I haven't messed around with such a complex book before, but I believe if you are a conversion house, and get the ACTUAL source documents from the publisher (lets say it was written in LaTeX), it would be a pain to transfer to EPUB, but nowhere near as horrendous as working backwards from a scan!

If the company actually wanted you to recreate a math book backwards from a scan... may the gods help you!!! It takes me forever just to handle a book with maybe 20 formulas... handling one with HUNDREDS?

Hitch · 12-09-2013, 03:24 PM

Quote:

Originally Posted by Tex2002ans

As was stated, most of this is made by hand. If it is built from an actual SCAN, the only way you can do it is extremely labor intensive snapshots (as Hitch mentioned).

Yes.

Quote:

I showed how I recreate A FEW (and I stress A FEW) formulas in books:

https://www.mobileread.com/forums/sho...d.php?t=223254

In that thread, Toxaris also pointed to this site which is not as customizable as my method, but is a lot less labor intensive:

https://www.mobileread.com/forums/sho...99&postcount=5

That thread alone will be enough to put anybody off. I love ya, BUT...

Quote:

If the conversion company is deriving the book from an actual SOURCE document, the conversion would be infinitely easier.... but still a complete pain.

The current state of the EPUB/MOBI market is not designed well for extremely complex mathematical books.

EPUB would be able to handle this easier than Amazon. You would at least be able to use SVG or MathML... but again, if you are working from SCANS, you have to manually recreate all those algorithms.

MathML is not supported on many devices at all, and only a few readers can use it (See MathJax: http://www.mathjax.org/resources/epub-readers/) and good luck trying to sell that anywhere besides your own dedicated store.

Well, that's why I think even discussing MathML is a bit...fruitless. I am not criticizing; what you are saying is true. If the source material is true source, great, but MathML doesn't really save you on the creation, unless you disagree? (I don't get to play with MathML much, as we never, ever get source, so...)

Quote:

Older Kindles can't handle SVG, and according to the Kindle Guidelines when you use SVG, you need to include an image fallback. See Section 8.4.2:

kindlegen.s3.amazonaws.com/AmazonKindlePublishingGuidelines.pdf

I think that sort of defeats a lot of the purpose, since you would have to double-up on every single formula in the book (SVG + PNG/JPG). With such an image heavy book, I think this would start cutting into those distribution fees that Amazon/other stores charge for really large books.

Eactly. It's useless for this type of application, and what's the point? It's silliness, unless, like K7-K8, they can choose to 'serve' the correct version only and not charge the publisher, but I can't see how that can occur. Unless they only accrue the SVG to the K8 document. Nah. I don't believe that.

Quote:

Mind posting a link to this sample?

Hmmm...lemme see. I don't have a link, TBH, because I'm not a big iBooks user, but I have the sample on my iPad...it's "McGraw Hill's Algebra 1, Aligned to the Core" book on iBooks. I mean, listen, don't get me wrong, for a textbook, it's fabulous, but, I took that and...well, let's say, I examined it. And I played with iAuthor, trying to cogitate upon how they made it. I had a request to make a similar book--for Amazon. Suffice to say, the reply I gave the guy didn't make him happy, as I had to explain what had to be done.

I'd say that it's possible that some of the equations were output from MathML...but many of the tabular equations are, I believe, embedded in there as images. Perhaps those were exported from MathML as-is? Don't know. I'd be very interested, Tex, in your thoughts. But I guarantee you, this entire book cost them every penny, in manhours in admin and labor, of $20K.

Quote:

I haven't messed around with such a complex book before, but I believe if you are a conversion house, and get the ACTUAL source documents from the publisher (lets say it was written in LaTeX), it would be a pain to transfer to EPUB, but nowhere near as horrendous as working backwards from a scan!

Of course. Not fun in either direction.

Quote:

If the company actually wanted you to recreate a math book backwards from a scan... may the gods help you!!! It takes me forever just to handle a book with maybe 20 formulas... handling one with HUNDREDS?

Yes. We've made books with 468 images, and those weren't teeny-weeny things, and they were helpfully provided by the client. Still...craploads of manhours. Craploads. And we spend forever doing various compression algorithms, etc., trying to make it "fit" in 20MB at Nook and 50 at Amazon. FOR-ever.

Hitch

Tex2002ans · 12-09-2013, 06:01 PM

Quote:

Originally Posted by Hitch

[...] but MathML doesn't really save you on the creation, unless you disagree?

My "Step 1" in the Tutorial could be: LibreOffice Math, LaTeX, MathML, InDesign, Inkscape, codecogs, ...

There is an assortment of ways to digitize the formulas.... but yes, as you say, it would still require all the manpower to actually convert the formulas to their digital equivalents.

The important thing I see for these books is actually having all the formulas vectorized. This is a HUGE advantage in the long-term of the book (instead of snapshots right out of a scan). Once you get the hard part done (getting it into an actual digital form), it would be easy FROM THAT POINT FORWARD to generate the formulas in whatever font/size/format you want.

You can see at the end of my tutorial, the comparison EPUB, and you can see that the digitally created images are WAY cleaner/nicer looking than the "straight from scan" formulas. Once you get a taste of the good stuff, you will never go back!

Having the formulas vectorized will only get better for the long-term of the book. When EPUB3 becomes more standard (MathML becomes more ubiquitous), Amazon starts accepting SVG/MathML, the next generation ebook formats come along, etc. etc.

Case #1: Let us say you take snapshots of the formulas right out of the scan. Now let us say these super high resolution Amazon devices come out (see Kindle HDX). All of these low resolution images might as well go right into the trash, because they appear as unreadable postage stamps. Now you wanted to fix the book, you would have to pay to have someone slog through the scans all over again, and take higher resolution snapshots!

Case #2: You initially suffer through all that pain of conversion to vector, now it will just be going back to my vector files and just export all the images at higher resolution (or export as MathML, or export as SVG, or export as format XYZ). Replace all images. BAM, new higher quality book, with barely any extra labor!

Quote:

Originally Posted by Hitch

Well, that's why I think even discussing MathML is a bit...fruitless.

Indeed.. it is one of those things where they could:

Case #1: Pay the Indian conversion company (pennies) to just do the crappy images and get hideous subpar output.

Case #2: Pay a nice chunk of change for conversion to vector formulas.

Vectorizing gives you the advantage in the present of having much cleaner images... but you won't really see the serious payoffs of vectorizing until MathML/vector support becomes more ubiquitous... and devices that come out that are higher and higher resolution.

I tend to think nice and long-term. Pay more up front to have it done RIGHT, and you will pay much less for maintenance in the long-run.

Vectorizing everything also gets you part of the way there if you ever want to go BACKWARDS from EPUB/HTML -> Print. This is an area where I am currently researching. So if you wanted to print a new edition of the book, this would be the way to go!

I would NEVER touch an actual math book that was scanned though... I value my time/hair too highly. I am ok with pulling my hair out on the very occasional book with 10 formulas.

Quote:

Originally Posted by Hitch

(I don't get to play with MathML much, as we never, ever get source, so...)

This is what really baffles me! You are telling me your company has never received InDesign/Quark/LaTeX/WordPerfect files? I mean, these are publishers coming TO YOU for conversion... you would think they have all of the source documents.

I recently was able to convert a book that was designed in WordPerfect (I contacted the author, who contacted the publisher, who gave me the source documents). I was able to open up the .wp in LibreOffice, use my RegexFu to convert it to super clean HTML, and I was able to get the EPUB up and running in no time!

So if a publisher comes to you with a new book they just designed in InDesign/Quark/WordPerfect, do you request them to export as EPUB/HTML? Or just hand over as the PDF and you work from there?

I was HORRIFIED when I learned that many of these conversion companies accept PDF ONLY! The conversion process would be:

Method #1: InDesign -> PDF -> OCR (errors/typos introduced) -> EPUB -> clean up.

Instead of doing the sensible thing when publishing a perfectly new book:

Method #2: InDesign/Source Document -> HTML/EPUB -> clean up.

So before I came on the scene to convert EPUBs for our teeny weeny non-profit publishing... instead, they were paying to get EPUBs full of needless typos/mistakes!!!

I imagine lots of other smaller publishers (non-profits, and hell, even small for-profits) are in the same situation! Think of all that manpower being wasted! And OCR just introduces so much needless errors! I mean, WHY WOULD YOU WASTE TIME OCRing SOMETHING WHEN YOU ALREADY HAVE THE PURELY DIGITAL SOURCE DOCUMENTS!!!!

Quote:

Originally Posted by Hitch

I'd say that it's possible that some of the equations were output from MathML...but many of the tabular equations are, I believe, embedded in there as images. Perhaps those were exported from MathML as-is? Don't know. I'd be very interested, Tex, in your thoughts. But I guarantee you, this entire book cost them every penny, in manhours in admin and labor, of $20K.

I don't know, I wish I worked behind the scenes at some large publisher, so I could see how things are done at the big boys.

I rarely get to work on new books, most of my work is working on old scans (or PDFs which were created in the last 20 years, but the digital source is gone/lost in the abyss). But when I do work from a purely digital source, I whip those things out within a few hours, and think how wonderful my life would be without having to work backwards from the dreaded PDF.

From what I gather, a larger publisher WOULD be designing these documents with long-term in mind, so they would put all their forumlas in LaTeX or MathML or SVG or PDF or EPS or AI. Then they have a system in place to just auto-export everything.

But yeah, the amount of money/manpower spent actually typesetting/designing these books is immense. And then us converters just get the little penny scraps (as you say, for cheaper than a dinner for two).

Quote:

Originally Posted by Hitch

Yes. We've made books with 468 images, and those weren't teeny-weeny things, and they were helpfully provided by the client. Still...craploads of manhours. Craploads. And we spend forever doing various compression algorithms, etc., trying to make it "fit" in 20MB at Nook and 50 at Amazon. FOR-ever.

Heh, heh... And introducing an outsider to the workflow, ugh... who knows how they generated the images, and of course, after you are at image #30, you notice something wrong with the way they exported, and you have to go back and redo all the images! I shudder to think how that would be handled with an outsider.

I am reminded about a long-running project I have going on with a Quarterly Journal we publish. I have been asking for all of the source files every quarter (so they don't get lost down the memory hole, I want to save any future selves from having to suffer through working backwards from a PDF!).

I have all of the tables/charts/graphs as .ai ... but Inkscape doesn't import/export them properly (the kerning in some of the fonts is off, the charts aren't bad, but this is especially noticeable in the Tables... think it might just be as easy as a font situation).

I don't have access to Illustrator, so I asked for some help from an amazing MR user... and he was kind enough to spend his time to export them for me! I was so ecstatic, and then after I took a closer look, I noticed that all of the table captions had the text included in the image.... I felt absolutely HORRIBLE.

Click image for larger version

Name: Keeler_Table1.png
Views: 481
Size: 46.8 KB
ID: 116469

Click image for larger version

Name: [Captioned]Keeler_Table1.png
Views: 494
Size: 86.6 KB
ID: 116470

xristy · 12-10-2013, 05:26 AM

Quote:

Originally Posted by Tex2002ans

Mind linking to an example?

Kline, Morris; Calculus: An Intuitive and Physical Approach

xristy · 12-10-2013, 06:13 AM

I suppose it's nice that some people in India get some work out of laboriously generating these ePub and mobi files.

I actually didn't realize how the "professionals" were creating these eBooks.

My concern is that the market is being flooded with garbage. It's really a shame that PDF as a format takes such a bad rap for eBooks when in fact it is actually far superior for technical materials (I realize that ePub / mobi are great for mass market fiction and lots of non-fiction which I'm sure is like 95% of the sales.) PDFs work great on an iPad with, for example, Goodreader and I expect there are excellent PDF readers on Android tablets as well. There's certainly no problem with PDF access on a PC.

I appreciate that there are limitations to some of the Kindle and other eBook readers but it is really unfortunate to destroy the typesetting of technical books for the sake of annoying the authors and the users. There are numerous situations in which Amazon pulls eBooks owing to shoddy preparation and there are plenty of negative reviews of Kindle versions.

I only use PDFs which limits availability since there is a mind set to dump mobi / ePubs that are poorly prepared and that apparently really can't use appropriate technologies such as SVG or MathML.

Creating good quality OCR'd PDFs via scanning of books for which source is not available is tractable for me with Acrobat X so I assume there are better tools available to commercial outfits.

Is there no way to promote the availability of PDFs for technical materials? Aren't PDFs routinely created as part of the workflow for physical printing? Are the sales of poorly formatted mobi / ePub technical books really sufficient to warrant the effort expended in producing them? Wouldn't it be cheaper by far to forego producing mobi / ePub versions and just offer much lower cost PDFs for sale?

Toxaris · 12-10-2013, 06:20 AM

The major gripe with PDF is that it is not suitable for e-book due to the fixed pages. It is an intermediate format for printing in my opinion. You cannot search (normally), change the font-size and so on. That makes it no so suitable for a lot of manuals/guides/technical documents.

Tex2002ans · 12-10-2013, 10:29 AM

Quote:

Originally Posted by xristy

I suppose it's nice that some people in India get some work out of laboriously generating these ePub and mobi files.

Sure, sure, but I see crap like the InDesign -> PDF -> OCR -> EPUB as a TON of waste manpower. Makes me want to pull my hair out! (As a programmer, I like to remove cruft).

Quote:

Originally Posted by xristy

My concern is that the market is being flooded with garbage.

Part of the reason why I hopped in on all these conversions. I was sick and tired of the typos/crappy spaghetti code.

And the problem will only be worse when the next book formats come out, a lot of the previous spaghetti code will just be ASKING for trouble (and there will be so many people paying to have their ebooks REDONE).

Like that image problem, the more that the Kindle HDX and higher resolution devices spread, the more people will start complaining about tiny thumbnail images.

Quote:

Originally Posted by xristy

It's really a shame that PDF as a format takes such a bad rap for eBooks when in fact it is actually far superior for technical materials (I realize that ePub / mobi are great for mass market fiction and lots of non-fiction which I'm sure is like 95% of the sales.) PDFs work great on an iPad with, for example, Goodreader and I expect there are excellent PDF readers on Android tablets as well. There's certainly no problem with PDF access on a PC.

Toxaris hits the nail on the head. PDF is built for PRINT. It was designed for a certain page size, and it was built to be a completely fixed format and LOOK THE SAME NO MATTER THE DEVICE. This means that the device you want to read on "MUST BE THIS LARGE". So a PDF designed for a book size, trying to be read on a smaller device (like an ereader or a phone), get a horrible experience.

Since your medium is FIXED (you know the page size, you know the margins, you know the font-size, you know the fonts being used, ...), you can get away with doing more complex typography. But once you start changing page sizes/changing things around, you will have to redo a lot of those typographical tweaks.

PDFs are also horrible for vision impaired users (Large Print Edition of books, IF the company decides to make them (which most don't)), and pretty poor for readers who are blind.

eBooks are a lot like HTML, it works on any size screen, reflows, any font-size, any font, any colors, any margins, etc. etc. It is about how the READER is most comfortable reading.

There is the whole "fixed-format EPUB" thing (which will get you closer to what you want, like with PDF), but they are a GIANT pain in the butt to make, and are really just geared towards ONE DEVICE only. So have fun moving that Kindle Fire fixed-format book to another Kindle. (Also, fixed-format EPUBs/MOBIs are very hard or impossible to get in the stores.).

Quote:

Originally Posted by xristy

I only use PDFs which limits availability since there is a mind set to dump mobi / ePubs that are poorly prepared and that apparently really can't use appropriate technologies such as SVG or MathML.

Indeed indeed... you need to pay for quality conversion, and many of these publishers/authors don't. They go with the cheapest.

And avoiding SVG/MathML for now is probably a good idea.... you would be cutting the potential readership down to a few % of the readers.

Good for future-proofing though (although knowing these damn math-books, they just come out with worthless new editions every year).

Quote:

Originally Posted by xristy

Creating good quality OCR'd PDFs via scanning of books for which source is not available is tractable for me with Acrobat X so I assume there are better tools available to commercial outfits.

I posted an outline of my PDF -> OCR -> EPUB method a while back:

https://www.mobileread.com/forums/sho...9&postcount=10

I use Finereader (one of the most accurate OCR programs), and this is pretty much what you have to do.

It requires just a lot of human labor to fix it up. For example, you can go take a look at archive.org and see how books look "just through OCR" (I laugh when people say "hey look, there is an EPUB of this book already on archive.org!").

Toxaris has his Word Tools which also might help speed up a lot of the book conversion (if you use Microsoft Word). But pretty much all of us just have to OCR -> many hours of human eyes/formatting -> high quality output.

As to the conversion market:

You have volunteers doing this stuff on MobileRead/Project Gutenberg/elsewhere
- Mostly just tackling Public Domain works
You get people who's job it is to convert (like me)
- In my case, mostly Public Domain, but even all of our new books are CC3.0 (as close to Public Domain as possible)
- Luckily, I was able to convince them with reason/quality... A/B crap/amazing comparisons are fantastic for this.
You have companies who get paid to convert quality
- Like Hitch's company, who is a VERY minor portion of the conversion market
Companies who get paid to convert crap
- A much larger portion of the conversion market
Large/Small publishers
- (Some of which who go through the crap converters, or have in-house converters)
- In many of these cases though, they just have their typographer export directly from InDesign, and of course, this output isn't the greatest
  - They probably took an InDesign course on "how to export to 'EPUB' (iBooks)!!!" So everything that is non-iBooks gets garbage.
  - A lot of these old-timey typographers are probably so used to the "print world" too, and not used to HTML/CSS/reflowability.
Then you have the flood of self-published stuff
- (Run through Calibre, SmashWords, or some other automatic conversion), which probably dwarfs all the rest combined.

Quote:

Originally Posted by xristy

Is there no way to promote the availability of PDFs for technical materials? Aren't PDFs routinely created as part of the workflow for physical printing?

Nowhere to really sell the PDFs though besides your own site. I believe PDFs for technical documents can be sold through something like Nook Study:

http://www.barnesandnoble.com/nookstudy/index.asp

But that is probably only open to huge publishers.

So for now, you get physical print + EPUB (every other book store) + MOBI (Amazon). Everything else is a free-for-all on individual sites.

Quote:

Originally Posted by xristy

Are the sales of poorly formatted mobi / ePub technical books really sufficient to warrant the effort expended in producing them?

Yes, the sales are massive (and if they pay those Indian conversion companies, hmmm, lets say it can be anywhere from $.50-$5 per page). This is WAY less than what they are paying a typographer/editor/everyone else to actually typeset the book.

As Hitch said, the converters get PENNIES compared to the total cost of producing the book.

So they have sunk in tens of thousands of dollars in a book, they can sink in a few hundred dollars for some crappy conversion, and get a HUGE boost in sales for the Kindle Editions.

Quote:

Originally Posted by xristy

Wouldn't it be cheaper by far to forego producing mobi / ePub versions and just offer much lower cost PDFs for sale?

People want to read these books on their multitude of different devices, and not be limited to PC/large tablets.

I assume most of these larger publishers also do sell the digital PDFs on their own stores (for example, all of these complex Math/Engineering books probably do sell PDFs right on the publisher's site).

But again, this is talking about the market for NEW books.

This is completely ignoring the entire hideous market of Scanned+Reprinted books. Those are another horrible beast altogether!

And scanned/reprinted MATH BOOKS... the horrors!

xristy · 12-10-2013, 11:04 AM

Quote:

Originally Posted by Toxaris

The major gripe with PDF is that it is not suitable for e-book due to the fixed pages. It is an intermediate format for printing in my opinion. You cannot search (normally), change the font-size and so on. That makes it no so suitable for a lot of manuals/guides/technical documents.

Quote:

Originally Posted by Tex2002ans

Toxaris hits the nail on the head. PDF is built for PRINT. It was designed for a certain page size, and it was built to be a completely fixed format and LOOK THE SAME NO MATTER THE DEVICE. This means that the device you want to read on "MUST BE THIS LARGE". So a PDF designed for a book size, trying to be read on a smaller device (like an ereader or a phone), get a horrible experience.

Since your medium is FIXED (you know the page size, you know the margins, you know the font-size, you know the fonts being used, ...), you can get away with doing more complex typography. But once you start changing page sizes/changing things around, you will have to redo a lot of those typographical tweaks.

PDFs are also horrible for vision impaired users (Large Print Edition of books, IF the company decides to make them (which most don't)), and pretty poor for readers who are blind.

Apparently neither of you have used a good PDF reader app on a tablet such as the iPad. You can certainly search normally, you can resize, you can reframe and so on.

In fact one of the problems with the use of images for display and in some cases inline equations is that they don't resize when the font is resized in a Kindle or ePub. Further, the images have a white background which works lousy when switching to inverted or sepia background in a Kindle reader. So the one of the major features of mobi / ePub - user controlled reflowable text is defeated by these very practices of converting materials using images for mathematical content.

My comments are concerning mathematical content and the non-use and non-support of MathML or similar. ePub with MathML on a supporting reader is quite reasonable but that's not going to happen with legacy books for which the most effective solution from the point of view of the user, is properly OCR'd PDF from scans.

I think this knee jerk response that PDF is for print and can only give a poor user experience on a tablet is indicative of lack of experience with tablets the size of the iPad and proper reader software.

Of course on a phone or a phablet - these are too small; however, they are also worthless for reading junk ePub or mobi formatted mathematical content which was my original topic.

There are certainly quite a number of authors who do their own typesetting in LaTeX and such and are quite disturbed at the loss of fidelity that occurs with current approaches to creating ePub and mobi versions from material that is laboriously typeset by the authors not the publishers.

I understand that publishers deal with typesetting text books for public schools and lower division college / university textbooks but most of the technical material from publishers such as Springer and Elsevier and their imprints is typeset by the authors.

Typesetting of technical material is part of the communication act and dumbing down the typesetting degrades the content.

Tex2002ans · 12-10-2013, 12:20 PM

Quote:

Originally Posted by xristy

Apparently neither of you have used a good PDF reader app on a tablet such as the iPad. You can certainly search normally, you can resize, you can reframe and so on.

But search in PDF is completely dependant on the text backend.

In the case of scanned text, this is typically OCRed automatically, and is extremely unreliable (see Archive.org).

Also, if the PDF export itself was not done properly (for example, "tagged"), then the text is ALSO unreliable. (Hopefully all purely digital works are being exported properly, but not everyone knows every little inch of functionality in the tools they are using).

Or you can take the example of the multitude of different PDF tools/converters/editors out there, who knows what is happening to the backend of the PDF. When editing something very minor in the PDF (let us say metadata), usually the entire file has to be rewritten.

Or let us say this is written in some older version of Quark (which you do not have installed any more), you open it up in InDesign, do a little tweak (add cover, add new Introduction, tweak the copyright page), and save again. The backend had the potential to get mangled in the process!

PDF is really built as a FINAL format, not an intermediary one. So if you do not have access to the actual source and RECREATE the PDF from there, you have the potential to introduce trouble.

If you are not the original author/publisher, you probably don't have access to this source... or even if you ARE, sometimes the source documents get lost in the abyss (computer crashes, no backups, publisher goes out of business, etc., etc.). Sometimes the dreaded PDF is all that is left! (ask me how I know)

Quote:

Originally Posted by xristy

I think this knee jerk response that PDF is for print and can only give a poor user experience on a tablet is indicative of lack of experience with tablets the size of the iPad and proper reader software.

Or we just had a taste of the "good stuff" (the EPUB kool-aid).

But I agree, very technical books are currently an area where ebooks are lacking. EPUB3 will help alleviate a little more of that pain (but still won't be perfect).

Quote:

Originally Posted by xristy

Of course on a phone or a phablet - these are too small; however, they are also worthless for reading junk ePub or mobi formatted mathematical content which was my original topic.

Reframing does help fit a PDF onto a smaller screen, and maybe turning to landscape might help fit the width of the text on a smaller screen... but compared to a quality EPUB on the same device?

Changing fonts?
Changing colors?
Changing font-size (NOT like a PDF where you zoom in/out)?

Hmmm... I don't know, I would need to see a math book that was quality converted to EPUB (which as you can see, an extremely tiny portion of the market). Maybe on MathJax's or Readium's site there are some quality examples.

Quote:

Originally Posted by xristy

There are certainly quite a number of authors who do their own typesetting in LaTeX and such and are quite disturbed at the loss of fidelity that occurs with current approaches to creating ePub and mobi versions from material that is laboriously typeset by the authors not the publishers.

[...]

Typesetting of technical material is part of the communication act and dumbing down the typesetting degrades the content.

EPUB3 might help out in a lot of this regard (allowing more advanced CSS)... but, there will still be a giant (and I mean GIANT) market of older devices out there. The sales on this will be pathetic (sort of like fixed-format EPUBs/MOBIs currently, or those "iBooks only" books).

Having the book already in LaTeX though is fantastic... that will definitely allow them to swap formats much easier than going through a crappy converter.... but the different size devices/fonts/everything else when converting to HTML comes into play, and will ruin a lot of the "typographical finesse" that comes with being able to "finesse" the text into a knowable size beforehand.

Or they can try to create multiple PDFs to suit different devices (PDF generated using LaTeX for 7" tablets, 10" tablets, iPads, PCs, ...) But this goes back to trying to sell this. This seems to be just asking for customer support headaches.

People already get confused over something as basic as 3 formats: Print/EPUB/Kindle. Let alone introducing multiple PDF sizes to choose from.

xristy · 12-10-2013, 10:08 PM

So in summary:

1) There are lots of older devices in the market which will not display properly prepared eBooks (and there's not even a way to properly prepare in the Kindle format)

2) Publishers and distributors can prepare eBooks with non-scaling images for equations and uncaught spelling and other typographical errors introduced in the OCR conversion process

3) some money can be made selling junk and in the meantime not providing usable PDFs in the market.

Mediocrity reigns!

There's at least one author preparing both a full-sized and a tablet sized PDF set to avoid compromising on quality: The Feynman Lectures on Physics.

Maybe a few more years and ePub 3 with MathML will be properly supported - currently Chrome and IE do not support MathML (Can I Use MathML?) so that leaves out a large chunk of the market. At least it is possible to generate credible ePub 3 with MathML from LaTex: Recommended workflow? LaTeX -> ePub with math.

Maybe if ePub 3 with MathML is supported then the hoards in India will be trained to re-write the equations using MathML - introducing more errors; or maybe someone will develop specialized math OCR to generate MathML.

I also observe that even if the OCR'd text layer is poor in the PDF, at least one still has the actual image of the text and that is not at all always preserved in the ePub / mobi versions. It is worth the minor issues with PDFs to have books preserved in a portable manner.

Bottom line: Offer the PDFs for those who prefer to use them. The marginal cost of the PDFs is minimal.

Tex2002ans · 12-11-2013, 01:27 AM

Quote:

Originally Posted by xristy

There's at least one author preparing both a full-sized and a tablet sized PDF set to avoid compromising on quality: The Feynman Lectures on Physics.

Fantastic, but the entire problem here is having to pay multiple times for multiple editions of a book. (one price for tablet, one price for PC, etc. etc.) The entire advantage of the EPUB is that you get ONE FILE that can scale and be run on larger or smaller devices.

Maybe if these publishers charged a ONE TIME fee, and then you can get access to ALL PDFs, that would be fantastic.

I assume that is what sort of happens when you pay for these online digital access bundles for math books (they will offer you a PDF for Desktop, a PDF for tablet, etc., HTML on their site, flash based, etc. etc.).... but usually these things are rife with DRM protection. I was also never one for paying such outrageous fees for temporary nonsense! (I hate this whole idea of the "one-time" online use code as well). You can pay an outrageous fee for the physical book (which you can use forever), or you can pay the outrageous fee minus $60 or so, for some hideously locked down digital version of the same book (which you can use temporarily, or can't read wherever you want).

I tend to try to also avoid anything that forces you towards one device (for example, I completely avoid using any sort of iBooks specific code in my EPUBs). That will only bring trouble once those devices disappear (or in the case of iBooks, they will most likely update and break any complex code you were dependent on).

Side Note: This sort of reminds me of the entire digital/film divide! I watched a documentary called "Side by Side": http://sidebysidethemovie.com/

Film has been around for about a century, and even the oldest films can still be viewed using any film projectors (think print books).... while the digital movies have gone through so many different formats/storage mediums, and many of the devices used to read these movies have gone the way of the dodo! Thus, causing many of the purely digital movies/tv shows/culture to be lost down the drain as well! (think ebooks).

Quote:

Originally Posted by xristy

Maybe a few more years and ePub 3 with MathML will be properly supported - currently Chrome and IE do not support MathML (Can I Use MathML?)

Indeed.. MathML is in its infancy, and that will be getting better in the coming years.

Quote:

Originally Posted by xristy

At least it is possible to generate credible ePub 3 with MathML from LaTex: Recommended workflow? LaTeX -> ePub with math.

Heh, that is actually what I was going to spend my time on this month, although going in the OPPOSITE direction. I am looking into EPUB -> LaTeX -> PDF. (Another reason why I am interested in vectorizing equations, and "HTMLizing" tables, so I can create high quality PDFs!).

Quote:

Originally Posted by xristy

Maybe if ePub 3 with MathML is supported then the hoards in India will be trained to re-write the equations using MathML - introducing more errors; or maybe someone will develop specialized math OCR to generate MathML.

You can only hope! But I don't even think that any sort of book with very complex equations would be very economical to convert from book scans (this book would have to be an older book that would still be expected to sell well, and still be relevant today, where the publisher doesn't have access to the original source).

Usually a lot of these older technical books have obsolete information, OR, they are already done better elsewhere, in an easier form. I am all about digitizing everything though... but for now, those would probably just have to be stuck as scanned PDFs.

MathOCR... now that would be something. OCR is an EXTREMELY hard problem, and adding all of these symbols on top/all over the place, I don't know how well it would work. I did a quick search and it seems like this might be one of the better solutions for that problem (but I expect it would still require a massive amount of human intervention):

http://www.inftyproject.org/en/softw...ml#InftyReader

Quote:

Originally Posted by xristy

I also observe that even if the OCR'd text layer is poor in the PDF, at least one still has the actual image of the text and that is not at all always preserved in the ePub / mobi versions. It is worth the minor issues with PDFs to have books preserved in a portable manner.

Yep yep. Hopefully my work is good enough though where I have an extremely low error rate. (Another reason why you want to pay for quality conversion and avoid those cheap guys).

That is part of the reason why we release everything as PDF/EPUB (sell MOBIs on Amazon), and sell physical books on Amazon/our store/elsewhere.

For newer books:

Physical
PDF right out of InDesign (that is used for the Print book).
EPUB
- Exported right out of InDesign, then I go in and do my Regexfu/magic on it.
MOBI (for sale on Amazon)
- Derived from the EPUB
HTML (every so often an entire chapter is posted on the site as a part of a "daily article")

Our mentality is that digital books are COMPLEMENTARY goods to go along with the physical books (so we offer them for free). This gives MUCH larger exposure to the book than would otherwise occur, you can see EXACTLY what you will be getting if you purchase the physical book, and for us, we have found that our book sales have skyrocketed since digitizing/releasing the books.

I assume a for-profit publisher could do something similar (maybe a free digital book download along with the physical purchase, which I see some doing now! Or take something like Amazon Matchbook, where you get a huge discount (or free) on the digital version).

For older books:

PDF of the original scans + OCR backend (just like archive.org)
EPUB editions (when I get around to converting them).
- HTML (chapters posted right out of my EPUBs, again, most likely as a part of a "daily article")
- MOBI
Physical (Reprints):
- I don't deal with this, so I don't know many details. To my knowledge, the scans are just cleaned up (speckles removed, etc. etc.), some new intro matter is added, and it is sent for sale.

Quote:

Originally Posted by xristy

Bottom line: Offer the PDFs for those who prefer to use them. The marginal cost of the PDFs is minimal.

I agree.. for anything that has been made within the past two or three decades... The stuff is definitely in digital form SOMEWHERE! And if you already designed the damn digital files for print, why not make those available too?

But as I stated, so many of these source files are lost in the abyss!!! And it is disappointing that all this culture is locked in physical form (or crappy PDFs or scans). Which is why I am making it my goal to chip away at these books in my little corner/niche (non-fiction economics books).

Side Note: You also have publishers who don't want to release a lot of their backlog, because they believe it will compete with their new book sales. There are also a ton of out-of-print works, in which the publishers have zero intention of bringing back to print. There are a ton of books which we would like to reprint, but the copyright owner either can't be found (orphan works), or tries to demand outrageous license fees for.

I once asked, "if we ever do pay for the license fees, would they help us by giving us the source files?" I was laughed at. So yeah, even if are going through all the legitimate channels, I doubt that many of these publishers would give you access to the source to make your life easier (although maybe if you worked at a big publisher, things might be different). So you would still be relegated to working backwards from a scan/PDF/OCR.

The license fees most likely make it unprofitable for us to even offer a reprint/digital edition, so we almost never do it. This is a huge problem when you look at all the hundreds of thousands/millions of out-of-print books which are in the same exact situation.

xristy · 12-11-2013, 01:59 AM

Quote:

Originally Posted by Tex2002ans

Fantastic, but the entire problem here is having to pay multiple times for multiple editions of a book. (one price for tablet, one price for PC, etc. etc.) The entire advantage of the EPUB is that you get ONE FILE that can scale and be run on larger or smaller devices.

Maybe if these publishers charged a ONE TIME fee, and then you can get access to ALL PDFs, that would be fantastic.

O'Reilly is a publisher which does charge a one-time fee for the set of mobi/ePub/PDF/Daisy with no DRM! Certainly it is a pricing choice and nothing inherent in the use of multiple sized PDFs.

Quote:

Originally Posted by Tex2002ans

You can only hope! But I don't even think that any sort of book with very complex equations would be very economical to convert from book scans (this book would have to be an older book that would still be expected to sell well, and still be relevant today, where the publisher doesn't have access to the original source).

Ooops! I forgot to add

.

Quote:

Originally Posted by Tex2002ans

Yep yep. Hopefully my work is good enough though where I have an extremely low error rate. (Another reason why you want to pay for quality conversion and avoid those cheap guys).

That is part of the reason why we release everything as PDF/EPUB (sell MOBIs on Amazon), and sell physical books on Amazon/our store/elsewhere.

As I have mentioned, I get very good results with Acrobat X and good quality scans.

I don't know what Archive.org is doing but their results are not very uplifting as far as OCR'd PDFs and searching.

Quote:

Originally Posted by Tex2002ans

Side Note: You also have publishers who don't want to release a lot of their backlog, because they believe it will compete with their new book sales. There are also a ton of out-of-print works, in which the publishers have zero intention of bringing back to print. There are a ton of books which we would like to reprint, but the copyright owner either can't be found (orphan works), or tries to demand outrageous license fees for.

I once asked, "if we ever do pay for the license fees, would they help us by giving us the source files?" I was laughed at. So yeah, even if are going through all the legitimate channels, I doubt that many of these publishers would give you access to the source to make your life easier (although maybe if you worked at a big publisher, things might be different). So you would still be relegated to working backwards from a scan/PDF/OCR.

The license fees most likely make it unprofitable for us to even offer a reprint/digital edition, so we almost never do it. This is a huge problem when you look at all the hundreds of thousands/millions of out-of-print books which are in the same exact situation.

Thanks for that observation.

12-10-2013, 06:13 AM	#8
xristy Connoisseur Posts: 54 Karma: 210 Join Date: Sep 2007 Device: iPad	interesting, more questions I suppose it's nice that some people in India get some work out of laboriously generating these ePub and mobi files. I actually didn't realize how the "professionals" were creating these eBooks. My concern is that the market is being flooded with garbage. It's really a shame that PDF as a format takes such a bad rap for eBooks when in fact it is actually far superior for technical materials (I realize that ePub / mobi are great for mass market fiction and lots of non-fiction which I'm sure is like 95% of the sales.) PDFs work great on an iPad with, for example, Goodreader and I expect there are excellent PDF readers on Android tablets as well. There's certainly no problem with PDF access on a PC. I appreciate that there are limitations to some of the Kindle and other eBook readers but it is really unfortunate to destroy the typesetting of technical books for the sake of annoying the authors and the users. There are numerous situations in which Amazon pulls eBooks owing to shoddy preparation and there are plenty of negative reviews of Kindle versions. I only use PDFs which limits availability since there is a mind set to dump mobi / ePubs that are poorly prepared and that apparently really can't use appropriate technologies such as SVG or MathML. Creating good quality OCR'd PDFs via scanning of books for which source is not available is tractable for me with Acrobat X so I assume there are better tools available to commercial outfits. Is there no way to promote the availability of PDFs for technical materials? Aren't PDFs routinely created as part of the workflow for physical printing? Are the sales of poorly formatted mobi / ePub technical books really sufficient to warrant the effort expended in producing them? Wouldn't it be cheaper by far to forego producing mobi / ePub versions and just offer much lower cost PDFs for sale?

12-10-2013, 10:08 PM	#13
xristy Connoisseur Posts: 54 Karma: 210 Join Date: Sep 2007 Device: iPad	So in summary: 1) There are lots of older devices in the market which will not display properly prepared eBooks (and there's not even a way to properly prepare in the Kindle format) 2) Publishers and distributors can prepare eBooks with non-scaling images for equations and uncaught spelling and other typographical errors introduced in the OCR conversion process 3) some money can be made selling junk and in the meantime not providing usable PDFs in the market. Mediocrity reigns! There's at least one author preparing both a full-sized and a tablet sized PDF set to avoid compromising on quality: The Feynman Lectures on Physics. Maybe a few more years and ePub 3 with MathML will be properly supported - currently Chrome and IE do not support MathML (Can I Use MathML?) so that leaves out a large chunk of the market. At least it is possible to generate credible ePub 3 with MathML from LaTex: Recommended workflow? LaTeX -> ePub with math. Maybe if ePub 3 with MathML is supported then the hoards in India will be trained to re-write the equations using MathML - introducing more errors; or maybe someone will develop specialized math OCR to generate MathML. I also observe that even if the OCR'd text layer is poor in the PDF, at least one still has the actual image of the text and that is not at all always preserved in the ePub / mobi versions. It is worth the minor issues with PDFs to have books preserved in a portable manner. Bottom line: Offer the PDFs for those who prefer to use them. The marginal cost of the PDFs is minimal. Last edited by xristy; 12-10-2013 at 11:46 PM. Reason: final thought

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Commercial ePub (3) authoring software	icsorea	ePub	9	06-12-2012 04:40 PM
Troubleshooting Kindle and math formula	DrShakalu	Amazon Kindle	12	12-11-2011 07:25 AM
tables, math formulas & different fonts in a .mobi file?	Zim	Kindle Formats	3	10-22-2011 07:10 PM
'Grey texts' and 'Typos' in Kindle ebooks	fyrogenesis	Amazon Kindle	3	02-01-2011 11:41 AM
Scanned books to Epub, best software?	Student1	Workshop	4	02-27-2009 03:08 PM

12-05-2013, 04:15 AM	#1
xristy Connoisseur Posts: 54 Karma: 210 Join Date: Sep 2007 Device: iPad	commercial software for Kindle/mobi from scanned math texts Hello, There are a number of Kindle math books - from Dover, for example, that appear to have been converted from scans to Kindle/mobi. Usually the display equations are images and sometimes inline math content is converted to some font and sometime appear as images. I am wondering what sort of commercial software is used to produce such eBooks. In other words how do companies that produce eBooks for a living generate these sort of eBooks from "legacy" texts. There are not so many ePubs - and I haven't yet seen any Dover eBooks other than Kindle from Amazon. Thanks

12-09-2013, 04:12 AM	#3
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	You can also create equations in SVG, but I am not aware if that is supported on the Kindle versions.

12-10-2013, 06:20 AM	#9
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	The major gripe with PDF is that it is not suitable for e-book due to the fixed pages. It is an intermediate format for printing in my opinion. You cannot search (normally), change the font-size and so on. That makes it no so suitable for a lot of manuals/guides/technical documents.