Improving "classic" text quality
I was working on an ebook of O. Henry's Four Million, and realized that the Gutenberg text has a number of undetected OCR errors. For example, in The Gift of the Magi, Gutenberg - along with about 30K web sites that have copied that text! - has "bugs" like: "And them Della leaped up ..." (where "them" should be "then"!)
I found these bugs by doing my own OCR on a 1913 copy of the book, and running that text and the Gutenberg text through a diff processor. I was expecting to find OCR errors in my own text (and I did), but I was surprised to find roughly as many errors in the Gutenberg text. (There are also differences in spelling and punctuation that may simply reflect editorial changes for preparing whatever physical text was used for the Gutenberg work; I'm not counting those.)
Has anyone else tried this technique (that is, comparing different OCR runs of different texts) as a method for improving the quality of their texts?
And, how much does anyone care?!
Steve
|