View Single Post
Old 06-15-2012, 09:50 AM   #1
Iznogood
Guru
Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.
 
Iznogood's Avatar
 
Posts: 932
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
Tools and methodology for easier proof-reading

Hi

I have recently been experimenting a bit, trying to find tools to help ease the proof-reading phase of the conversion from paper books to epub.

When running OCR, there will always be errors, and every program has its own algorithm for recognizing texts. In other words, ABBYY Finereader is good at some things, OmniPage is good at other things, and OmniPage can correctly recognize text that Finereader will not recognize correctly.

My idea is that by running the same scan through several OCR programs, and comparing the output, I could automatically detect some of the errors. Of course, ordinary diff will not be sufficient, because the markup will differ greatly, but I have found a program named HTML Match that can process html pages and show the differences between them (screenshot attached).

I have also experimented with two editions of the same book; the two editions was published with some years between them, set with slightly different fonts and had other minor differences. I observed that phrases with errors in one edition could be correctly recognized in the other edition.

My theory from this is that by having multiple versions of the source, one can to a large extent detect errors automatically. Versions of text can come from the following sources:
  • scans of a book
  • scans from another (identical) copy of the same book
  • scans from different editions of the book
  • raw scans or scans clean with e.g. ScanTailor
It is also possible to extend this list to include versions of the epub from the darknet or from project Gutenberg. It sounds a bit stupid to scan the book if you already have it from one of these two sources, but it is possible to compare these two formats against each other to find the differences and correct the errors.

HTML Match has its errors and there are certainly weaknesses in the algorithm. Are there anyone else "out there" using a similar method, or knowing some better tools for diffing html files than HTML Match?

I have been fantasizing about writing a similar program myself, but correcting some of the errors in the algorithm used by HTML Match and possible making it interactive - with HTML match, I have to find the errors, and then find them in the original file and correct them there. It would be better to have just one program to do this in.

Any tips on more suitable software or ways do detect OCR errors are most welcome
Attached Thumbnails
Click image for larger version

Name:	htmldiff.jpg
Views:	607
Size:	526.7 KB
ID:	87744  

Last edited by Iznogood; 06-15-2012 at 09:52 AM.
Iznogood is offline   Reply With Quote