MobileRead Forums - View Single Post - Tools and methodology for easier proof-reading

Iznogood · 06-15-2012, 09:50 AM

Hi

I have recently been experimenting a bit, trying to find tools to help ease the proof-reading phase of the conversion from paper books to epub.

When running OCR, there will always be errors, and every program has its own algorithm for recognizing texts. In other words, ABBYY Finereader is good at some things, OmniPage is good at other things, and OmniPage can correctly recognize text that Finereader will not recognize correctly.

My idea is that by running the same scan through several OCR programs, and comparing the output, I could automatically detect some of the errors. Of course, ordinary diff will not be sufficient, because the markup will differ greatly, but I have found a program named HTML Match that can process html pages and show the differences between them (screenshot attached).

I have also experimented with two editions of the same book; the two editions was published with some years between them, set with slightly different fonts and had other minor differences. I observed that phrases with errors in one edition could be correctly recognized in the other edition.

My theory from this is that by having multiple versions of the source, one can to a large extent detect errors automatically. Versions of text can come from the following sources:

scans of a book
scans from another (identical) copy of the same book
scans from different editions of the book
raw scans or scans clean with e.g. ScanTailor

It is also possible to extend this list to include versions of the epub from the darknet or from project Gutenberg. It sounds a bit stupid to scan the book if you already have it from one of these two sources, but it is possible to compare these two formats against each other to find the differences and correct the errors.

HTML Match has its errors and there are certainly weaknesses in the algorithm. Are there anyone else "out there" using a similar method, or knowing some better tools for diffing html files than HTML Match?

I have been fantasizing about writing a similar program myself, but correcting some of the errors in the algorithm used by HTML Match and possible making it interactive - with HTML match, I have to find the errors, and then find them in the original file and correct them there. It would be better to have just one program to do this in.

Any tips on more suitable software or ways do detect OCR errors are most welcome

06-15-2012, 09:50 AM	#1
Iznogood Guru Posts: 932 Karma: 15752887 Join Date: Mar 2011 Location: Norway Device: Ipad, kindle paperwhite	Tools and methodology for easier proof-reading Hi I have recently been experimenting a bit, trying to find tools to help ease the proof-reading phase of the conversion from paper books to epub. When running OCR, there will always be errors, and every program has its own algorithm for recognizing texts. In other words, ABBYY Finereader is good at some things, OmniPage is good at other things, and OmniPage can correctly recognize text that Finereader will not recognize correctly. My idea is that by running the same scan through several OCR programs, and comparing the output, I could automatically detect some of the errors. Of course, ordinary diff will not be sufficient, because the markup will differ greatly, but I have found a program named HTML Match that can process html pages and show the differences between them (screenshot attached). I have also experimented with two editions of the same book; the two editions was published with some years between them, set with slightly different fonts and had other minor differences. I observed that phrases with errors in one edition could be correctly recognized in the other edition. My theory from this is that by having multiple versions of the source, one can to a large extent detect errors automatically. Versions of text can come from the following sources: scans of a book scans from another (identical) copy of the same book scans from different editions of the book raw scans or scans clean with e.g. ScanTailor It is also possible to extend this list to include versions of the epub from the darknet or from project Gutenberg. It sounds a bit stupid to scan the book if you already have it from one of these two sources, but it is possible to compare these two formats against each other to find the differences and correct the errors. HTML Match has its errors and there are certainly weaknesses in the algorithm. Are there anyone else "out there" using a similar method, or knowing some better tools for diffing html files than HTML Match? I have been fantasizing about writing a similar program myself, but correcting some of the errors in the algorithm used by HTML Match and possible making it interactive - with HTML match, I have to find the errors, and then find them in the original file and correct them there. It would be better to have just one program to do this in. Any tips on more suitable software or ways do detect OCR errors are most welcome Attached Thumbnails Last edited by Iznogood; 06-15-2012 at 09:52 AM.