Improving "classic" text quality

swr2408018 · 12-29-2010, 04:07 PM

I was working on an ebook of O. Henry's Four Million, and realized that the Gutenberg text has a number of undetected OCR errors. For example, in The Gift of the Magi, Gutenberg - along with about 30K web sites that have copied that text! - has "bugs" like: "And them Della leaped up ..." (where "them" should be "then"!)

I found these bugs by doing my own OCR on a 1913 copy of the book, and running that text and the Gutenberg text through a diff processor. I was expecting to find OCR errors in my own text (and I did), but I was surprised to find roughly as many errors in the Gutenberg text. (There are also differences in spelling and punctuation that may simply reflect editorial changes for preparing whatever physical text was used for the Gutenberg work; I'm not counting those.)

Has anyone else tried this technique (that is, comparing different OCR runs of different texts) as a method for improving the quality of their texts?

And, how much does anyone care?!

Steve

HarryT · 12-29-2010, 04:48 PM

Personally I care a great deal, which is why I spent on average a couple of hours every evening proof-reading public domain texts (see my signature for what I'm proofing at the moment). The only way to properly proof, though, is to read the e-text and a printed text (or page scan) side-by-side and directly compare them.

swr2408018 · 12-29-2010, 07:56 PM

What's interesting about the dual (or more) comparison technique is that it demonstrably catches OCR errors that have escaped multiple passes of side-by-side ebook and physical book examination. Where it would fail, of course, are when both OCR outputs have the same error in the same place. At some point, simply doing additional OCR passes on additional instances of the text will have diminishing returns, but you may still not be down to 0 defects.

It would be interesting to perform this experiment: find a text with lots of good scans at the Internet Archive, then track the number of additional defects found with each additional comparison, and then engage multiple sets of eyes to look for remaining defects after that process is complete.

Steve

12-29-2010, 04:07 PM	#1
swr2408018 Enthusiast Posts: 35 Karma: 501 Join Date: Jul 2007 Device: PRS-500	Improving "classic" text quality I was working on an ebook of O. Henry's Four Million, and realized that the Gutenberg text has a number of undetected OCR errors. For example, in The Gift of the Magi, Gutenberg - along with about 30K web sites that have copied that text! - has "bugs" like: "And them Della leaped up ..." (where "them" should be "then"!) I found these bugs by doing my own OCR on a 1913 copy of the book, and running that text and the Gutenberg text through a diff processor. I was expecting to find OCR errors in my own text (and I did), but I was surprised to find roughly as many errors in the Gutenberg text. (There are also differences in spelling and punctuation that may simply reflect editorial changes for preparing whatever physical text was used for the Gutenberg work; I'm not counting those.) Has anyone else tried this technique (that is, comparing different OCR runs of different texts) as a method for improving the quality of their texts? And, how much does anyone care?! Steve

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Changing book titles of epubs (specifically removing "Barnes & Noble Classic Series"	jettrue	Calibre	3	08-03-2010 10:02 AM
How to make text "darker" and a bit smaller in conversions to LRF?	Teddman	LRF	2	02-10-2010 07:04 PM
"PK": Only text when I open in Sigil an ePub file generated with Calibre	Terisa de morgan	Sigil	3	12-14-2009 12:24 PM
Zune eBook Creator (RTextAsImage) - "Convert" text to images	oleg.shastitko	Reading and Management	10	01-28-2008 02:18 PM
Decreasing number of "quality" books on Connect...?	Zevs	Sony Reader	15	01-10-2008 08:56 AM

12-29-2010, 04:48 PM	#2
HarryT eBook Enthusiast Posts: 85,560 Karma: 93980341 Join Date: Nov 2006 Location: UK Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6	Personally I care a great deal, which is why I spent on average a couple of hours every evening proof-reading public domain texts (see my signature for what I'm proofing at the moment). The only way to properly proof, though, is to read the e-text and a printed text (or page scan) side-by-side and directly compare them.

12-29-2010, 07:56 PM	#3
swr2408018 Enthusiast Posts: 35 Karma: 501 Join Date: Jul 2007 Device: PRS-500	What's interesting about the dual (or more) comparison technique is that it demonstrably catches OCR errors that have escaped multiple passes of side-by-side ebook and physical book examination. Where it would fail, of course, are when both OCR outputs have the same error in the same place. At some point, simply doing additional OCR passes on additional instances of the text will have diminishing returns, but you may still not be down to 0 defects. It would be interesting to perform this experiment: find a text with lots of good scans at the Internet Archive, then track the number of additional defects found with each additional comparison, and then engage multiple sets of eyes to look for remaining defects after that process is complete. Steve

Advert