MobileRead Forums - View Single Post

swr2408018 · 12-29-2010, 03:07 PM

I was working on an ebook of O. Henry's Four Million, and realized that the Gutenberg text has a number of undetected OCR errors. For example, in The Gift of the Magi, Gutenberg - along with about 30K web sites that have copied that text! - has "bugs" like: "And them Della leaped up ..." (where "them" should be "then"!)

I found these bugs by doing my own OCR on a 1913 copy of the book, and running that text and the Gutenberg text through a diff processor. I was expecting to find OCR errors in my own text (and I did), but I was surprised to find roughly as many errors in the Gutenberg text. (There are also differences in spelling and punctuation that may simply reflect editorial changes for preparing whatever physical text was used for the Gutenberg work; I'm not counting those.)

Has anyone else tried this technique (that is, comparing different OCR runs of different texts) as a method for improving the quality of their texts?

And, how much does anyone care?!

Steve

12-29-2010, 03:07 PM	#1
swr2408018 Enthusiast Posts: 35 Karma: 501 Join Date: Jul 2007 Device: PRS-500	Improving "classic" text quality I was working on an ebook of O. Henry's Four Million, and realized that the Gutenberg text has a number of undetected OCR errors. For example, in The Gift of the Magi, Gutenberg - along with about 30K web sites that have copied that text! - has "bugs" like: "And them Della leaped up ..." (where "them" should be "then"!) I found these bugs by doing my own OCR on a 1913 copy of the book, and running that text and the Gutenberg text through a diff processor. I was expecting to find OCR errors in my own text (and I did), but I was surprised to find roughly as many errors in the Gutenberg text. (There are also differences in spelling and punctuation that may simply reflect editorial changes for preparing whatever physical text was used for the Gutenberg work; I'm not counting those.) Has anyone else tried this technique (that is, comparing different OCR runs of different texts) as a method for improving the quality of their texts? And, how much does anyone care?! Steve