MobileRead Forums - View Single Post - Authors Guild and Google reach settlement: Millions of scanned books to be available.

DMcCunney · 10-29-2008, 10:40 AM

Quote:

Originally Posted by TallMomof2

A scanned page image is essentially a photograph or picture of the page. Like a picture it is not seen as text (characters) by the ebook program. What you have to do is run the scanned pages through an OCR program to convert the images to text so that it is treated as text instead of an image. The "gotcha" is that conversion usually results in many errors that require a human to edit the text. I can't tell you how many ebooks I've read that are poorly converted scanned pages. And these are from legitimate publishers.

Precisely. No OCR program is perfect. Ligatures are a special problem, and multi-column formats can throw the OCR software included with things like home scanners. Higher end professional gear does better, but it costs, and there will still be editing and proofreading to get good copy.

The publishers whose lacking work you read skimped on or eliminated the editing step to cut costs.

(And that's just for texts in the Roman alphabet. If the original book was in something else, all bets are off.)
______
Dennis