![]() |
#1 |
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() Posts: 35
Karma: 501
Join Date: Jul 2007
Device: PRS-500
|
Improving "classic" text quality
I was working on an ebook of O. Henry's Four Million, and realized that the Gutenberg text has a number of undetected OCR errors. For example, in The Gift of the Magi, Gutenberg - along with about 30K web sites that have copied that text! - has "bugs" like: "And them Della leaped up ..." (where "them" should be "then"!)
I found these bugs by doing my own OCR on a 1913 copy of the book, and running that text and the Gutenberg text through a diff processor. I was expecting to find OCR errors in my own text (and I did), but I was surprised to find roughly as many errors in the Gutenberg text. (There are also differences in spelling and punctuation that may simply reflect editorial changes for preparing whatever physical text was used for the Gutenberg work; I'm not counting those.) Has anyone else tried this technique (that is, comparing different OCR runs of different texts) as a method for improving the quality of their texts? And, how much does anyone care?! Steve |
![]() |
![]() |
![]() |
#2 |
eBook Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 85,544
Karma: 93383099
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
Personally I care a great deal, which is why I spent on average a couple of hours every evening proof-reading public domain texts (see my signature for what I'm proofing at the moment). The only way to properly proof, though, is to read the e-text and a printed text (or page scan) side-by-side and directly compare them.
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() Posts: 35
Karma: 501
Join Date: Jul 2007
Device: PRS-500
|
What's interesting about the dual (or more) comparison technique is that it demonstrably catches OCR errors that have escaped multiple passes of side-by-side ebook and physical book examination. Where it would fail, of course, are when both OCR outputs have the same error in the same place. At some point, simply doing additional OCR passes on additional instances of the text will have diminishing returns, but you may still not be down to 0 defects.
It would be interesting to perform this experiment: find a text with lots of good scans at the Internet Archive, then track the number of additional defects found with each additional comparison, and then engage multiple sets of eyes to look for remaining defects after that process is complete. Steve |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Changing book titles of epubs (specifically removing "Barnes & Noble Classic Series" | jettrue | Calibre | 3 | 08-03-2010 09:02 AM |
How to make text "darker" and a bit smaller in conversions to LRF? | Teddman | LRF | 2 | 02-10-2010 06:04 PM |
"PK": Only text when I open in Sigil an ePub file generated with Calibre | Terisa de morgan | Sigil | 3 | 12-14-2009 11:24 AM |
Zune eBook Creator (RTextAsImage) - "Convert" text to images | oleg.shastitko | Reading and Management | 10 | 01-28-2008 01:18 PM |
Decreasing number of "quality" books on Connect...? | Zevs | Sony Reader | 15 | 01-10-2008 07:56 AM |