MobileRead Forums - View Single Post - Some perspective on why ebooks are so filled with errors.

Elfwreck · 10-25-2011, 10:50 AM

Quote:

Originally Posted by Hitch

And, just for s&g's, I've asked several clients if I may use some of their pages here, for demonstration purposes (I do not know if I will obtain permission)--these clients had scan & OCR. I'm asking them if I may post 1-2 original pages from a PDF, and the resulting RAW scanned output;

I have samples. I do ebook conversions of public domain work that Gutenberg doesn't have, and a few other things.

This is a page from "Tales of Hoffman - Trial of the Chicago 8 7", which is not in the public domain, but majority of the text is, because trial transcripts are public domain. I figure that a page for educational purposes falls well within fair use, for the thirty words that may not be part of the transcript. (It's possible all of it is transcript.)

The PDF was scanned at 400dpi in Acrobat Pro (Which isn't the best, but is tolerable); the Word doc is auto-read in Finereader 7, after removing the page number. For this one, read quality's great; line breaks and the separating asterisks are the big problem.

Second sample is from "Magic and Fetishism," a public domain work available through Archive.org.

This one has more obvious problems. Extra punctuation caused by dots on the page, the foreign words are mostly misspelled, the punctuation is often wrong. And this is a good, clear scan of text that isn't tightly condensed.

Next sample: from Inglis' "Principles of Secondary Education," another PD book. This one's a nightmare for conversion; lots of tiny text in charts & tables.

I don't do most of the corrections in Word; I do them in Finereader, where I can see the text next to the scans, but that's not always an option.