Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 12-29-2010, 03:07 PM   #1
swr2408018
Enthusiast
swr2408018 will become famous soon enoughswr2408018 will become famous soon enoughswr2408018 will become famous soon enoughswr2408018 will become famous soon enoughswr2408018 will become famous soon enoughswr2408018 will become famous soon enough
 
Posts: 35
Karma: 501
Join Date: Jul 2007
Device: PRS-500
Improving "classic" text quality

I was working on an ebook of O. Henry's Four Million, and realized that the Gutenberg text has a number of undetected OCR errors. For example, in The Gift of the Magi, Gutenberg - along with about 30K web sites that have copied that text! - has "bugs" like: "And them Della leaped up ..." (where "them" should be "then"!)

I found these bugs by doing my own OCR on a 1913 copy of the book, and running that text and the Gutenberg text through a diff processor. I was expecting to find OCR errors in my own text (and I did), but I was surprised to find roughly as many errors in the Gutenberg text. (There are also differences in spelling and punctuation that may simply reflect editorial changes for preparing whatever physical text was used for the Gutenberg work; I'm not counting those.)

Has anyone else tried this technique (that is, comparing different OCR runs of different texts) as a method for improving the quality of their texts?

And, how much does anyone care?!

Steve
swr2408018 is offline   Reply With Quote
Old 12-29-2010, 03:48 PM   #2
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383099
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Personally I care a great deal, which is why I spent on average a couple of hours every evening proof-reading public domain texts (see my signature for what I'm proofing at the moment). The only way to properly proof, though, is to read the e-text and a printed text (or page scan) side-by-side and directly compare them.
HarryT is offline   Reply With Quote
Advert
Old 12-29-2010, 06:56 PM   #3
swr2408018
Enthusiast
swr2408018 will become famous soon enoughswr2408018 will become famous soon enoughswr2408018 will become famous soon enoughswr2408018 will become famous soon enoughswr2408018 will become famous soon enoughswr2408018 will become famous soon enough
 
Posts: 35
Karma: 501
Join Date: Jul 2007
Device: PRS-500
What's interesting about the dual (or more) comparison technique is that it demonstrably catches OCR errors that have escaped multiple passes of side-by-side ebook and physical book examination. Where it would fail, of course, are when both OCR outputs have the same error in the same place. At some point, simply doing additional OCR passes on additional instances of the text will have diminishing returns, but you may still not be down to 0 defects.

It would be interesting to perform this experiment: find a text with lots of good scans at the Internet Archive, then track the number of additional defects found with each additional comparison, and then engage multiple sets of eyes to look for remaining defects after that process is complete.

Steve
swr2408018 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Changing book titles of epubs (specifically removing "Barnes & Noble Classic Series" jettrue Calibre 3 08-03-2010 09:02 AM
How to make text "darker" and a bit smaller in conversions to LRF? Teddman LRF 2 02-10-2010 06:04 PM
"PK": Only text when I open in Sigil an ePub file generated with Calibre Terisa de morgan Sigil 3 12-14-2009 11:24 AM
Zune eBook Creator (RTextAsImage) - "Convert" text to images oleg.shastitko Reading and Management 10 01-28-2008 01:18 PM
Decreasing number of "quality" books on Connect...? Zevs Sony Reader 15 01-10-2008 07:56 AM


All times are GMT -4. The time now is 09:51 PM.


MobileRead.com is a privately owned, operated and funded community.