Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 06-15-2012, 09:50 AM   #1
Iznogood
Guru
Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.
 
Iznogood's Avatar
 
Posts: 929
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
Tools and methodology for easier proof-reading

Hi

I have recently been experimenting a bit, trying to find tools to help ease the proof-reading phase of the conversion from paper books to epub.

When running OCR, there will always be errors, and every program has its own algorithm for recognizing texts. In other words, ABBYY Finereader is good at some things, OmniPage is good at other things, and OmniPage can correctly recognize text that Finereader will not recognize correctly.

My idea is that by running the same scan through several OCR programs, and comparing the output, I could automatically detect some of the errors. Of course, ordinary diff will not be sufficient, because the markup will differ greatly, but I have found a program named HTML Match that can process html pages and show the differences between them (screenshot attached).

I have also experimented with two editions of the same book; the two editions was published with some years between them, set with slightly different fonts and had other minor differences. I observed that phrases with errors in one edition could be correctly recognized in the other edition.

My theory from this is that by having multiple versions of the source, one can to a large extent detect errors automatically. Versions of text can come from the following sources:
  • scans of a book
  • scans from another (identical) copy of the same book
  • scans from different editions of the book
  • raw scans or scans clean with e.g. ScanTailor
It is also possible to extend this list to include versions of the epub from the darknet or from project Gutenberg. It sounds a bit stupid to scan the book if you already have it from one of these two sources, but it is possible to compare these two formats against each other to find the differences and correct the errors.

HTML Match has its errors and there are certainly weaknesses in the algorithm. Are there anyone else "out there" using a similar method, or knowing some better tools for diffing html files than HTML Match?

I have been fantasizing about writing a similar program myself, but correcting some of the errors in the algorithm used by HTML Match and possible making it interactive - with HTML match, I have to find the errors, and then find them in the original file and correct them there. It would be better to have just one program to do this in.

Any tips on more suitable software or ways do detect OCR errors are most welcome
Attached Thumbnails
Click image for larger version

Name:	htmldiff.jpg
Views:	126
Size:	526.7 KB
ID:	87744  

Last edited by Iznogood; 06-15-2012 at 09:52 AM.
Iznogood is offline   Reply With Quote
Old 06-15-2012, 10:55 AM   #2
Doitsu
Wizard
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 1,989
Karma: 4633978
Join Date: Dec 2010
Device: Kindle PW2
Fellow Norwegian MR member SBT posted a similar topic some time ago and came up with an interesting script himself. Unfortunately, you can only use it if you have a Linux machine or a Mac. (Windows users need to install Cygwin.)
Doitsu is offline   Reply With Quote
Old 06-15-2012, 11:53 AM   #3
Iznogood
Guru
Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.
 
Iznogood's Avatar
 
Posts: 929
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
I run ubuntu myself and windows is "confined" to virtualbox, so I will certainly take a look at his script. If I read his code correctly, he compares everything, html markup, css styles and html text. When taking into account that markup can differ without it affecting the epub, I don't think an "ordinary" diff or diff3 will do what I wish to do

Last edited by Iznogood; 06-15-2012 at 12:01 PM. Reason: typo
Iznogood is offline   Reply With Quote
Old 06-16-2012, 12:40 AM   #4
Iznogood
Guru
Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.
 
Iznogood's Avatar
 
Posts: 929
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
After a bit of more searching (and researching), I did find a program called DiffMerge that is able to run diff, ignore tags and/or classes, depending on the configuration of it. It is also capable of merging three sources into one, and the best part of it: it's cross-platform and free(!).
Iznogood is offline   Reply With Quote
Old 06-16-2012, 03:19 AM   #5
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 64,005
Karma: 42472847
Join Date: Nov 2006
Location: UK
Device: PW2, iPad Retina Mini, iPhone 4, MS Surface Pro, Kobo H2O, N7
All these tools can help but, at the end of the day, there's no substitute for human proofreading. The only way to properly proofread a text is to have the computer screen alongside the printed book and read them in parallel, word by word, comma by comma.
HarryT is offline   Reply With Quote
Old 06-16-2012, 04:37 AM   #6
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 505
Karma: 560981
Join Date: Sep 2010
Location: Norway
Device: prs-t1, phone/Cool Reader, tablet/BlueFire, Nook Simple
I do proofing broadly similar to norway1456. My workflow is roughly as follows:
  1. Download multiple versions of a book from the Internet Archive
    AND/OR
  2. Do two separate scans, 150 and 300 dpi is what I use.
  3. Use vimdiff for spotting differences and merging
  4. Put scan images and revised text side by side in an HTML file, import into LibreOffice, run spellcheck, and proofread, with particular attention to paragraphs, italics, and punctuation.
  5. Finally, add HTML code and run text through home-brewed scripts to create XHTML file and epub-file.
I use Adobe Acrobat X Pro; I haven't tried any others, but it seems to do a decent job.
vimdiff isn't exactly user friendly, but when you've learnt the key combinations, it's darn fast, and carpal tunnel friendly.
I try to eliminate trivial differences between the scanned texts before diffing, in particular different lengths in initial spaces. The following regexps handle this:
Code:
1,$s/^ *\([a-z]\)/\1/
1,$s/^    *\([A-Z"']\)/\t\1/
1,$s/^ \([^ ]\)/\1/
SBT is offline   Reply With Quote
Old 06-16-2012, 10:58 AM   #7
mrmikel
Book Twiddler
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,086
Karma: 1444487
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
Woe

Quote:
Originally Posted by HarryT View Post
All these tools can help but, at the end of the day, there's no substitute for human proofreading. The only way to properly proofread a text is to have the computer screen alongside the printed book and read them in parallel, word by word, comma by comma.
This is the woe of the e-book creator. By the time you go through a book creating it, you have had enough of it for a long time unless you are especially fond of it.
mrmikel is offline   Reply With Quote
Old 06-16-2012, 05:02 PM   #8
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 505
Karma: 560981
Join Date: Sep 2010
Location: Norway
Device: prs-t1, phone/Cool Reader, tablet/BlueFire, Nook Simple
Quote:
Originally Posted by mrmikel View Post
This is the woe of the e-book creator. By the time you go through a book creating it, you have had enough of it for a long time unless you are especially fond of it.
A good point. I'm deeply impressed by HarryT's dedication, but for myself I'm satisfied as long as the number of remaining errors do not mar the reading experience noticeably. (I know, 'noticeable' is an unknown variable for each reader...)
To achieve this, I've tried to organize the proofreading workflow so that I can read the book through for a final proofreading, still not be sickeningly familiar with its contents, while neither having to stop for every other sentence to tag a mistaek. After all, I'm supposed to be doing this for fun ....
SBT is offline   Reply With Quote
Old 06-16-2012, 09:24 PM   #9
pholy
Booklegger
pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.
 
pholy's Avatar
 
Posts: 1,798
Karma: 7999034
Join Date: Jun 2009
Location: Toronto, Ontario, Canada
Device: BeBook(1 & 2010), PEZ, PRS-505, Kobo BT, PRS-T1, Playbook, Kobo Touch
Quote:
After all, I'm supposed to be doing this for fun ....
Ahh, there's the difference. Harry is doing it for posterity! I do admire his dedication.
pholy is offline   Reply With Quote
Old 06-16-2012, 10:16 PM   #10
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 37,614
Karma: 18390312
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Sony Reader PRS-650, iPad, nook STR
The only way to do a PDF and OCR conversion is to include a full A/B comparison in the workflow and if you don't, don't bother to do it at all.
JSWolf is offline   Reply With Quote
Old 06-17-2012, 04:10 PM   #11
Iznogood
Guru
Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.
 
Iznogood's Avatar
 
Posts: 929
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
I know that proof reading is a necessity, and that it must be done thoroughly if it should be of any good at all. Also formatting of the book must be done manually. The OCR program is no good at wrapping special parts of text so that they wrap in a decent way with various font sizes/screen sizes.

But as my countryman SBT points out, it should be done for fun, and therefore the more errors are auto detected, the less interruption in the reading experience while proof-reading, and the more fun it is.

Besides: as a software man, I know that there always are, and always will be, bugs in any file, software code or html pages. While proof-reading, you find and correct maybe 98% of these. But the remaining 2% goes by undetected. If using some tool to find errors and highlight them, you might be able to find 99% of the errors.
Iznogood is offline   Reply With Quote
Old 06-18-2012, 01:08 AM   #12
PeterT
Taking a break; Fed up
PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.
 
PeterT's Avatar
 
Posts: 6,932
Karma: 43999669
Join Date: Nov 2007
Location: Toronto
Device: Wife: Touch, Arc, Vox Me: Nexus 7, Glo
It might be overkill but Project Gutenburg has an associated project "Distributed Proofreaders" at http://www.pgdp.net/c/

Their approach is to display on the screen the scanned page in image format, and the OCR'ed text. They do make their entire system available at http://sourceforge.net/projects/dproofreaders/

Someone might be interested in running their own personal DP website and using it to handle the OCR validation side; yes I realize that this would still leave the markup to be done separately.
PeterT is online now   Reply With Quote
Old 06-18-2012, 03:24 AM   #13
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 505
Karma: 560981
Join Date: Sep 2010
Location: Norway
Device: prs-t1, phone/Cool Reader, tablet/BlueFire, Nook Simple
I've wondered what's the best way of handling words split over lines when proofing OCR texts.
I use sed to get all of the word on one line, and then do interactive search&replace in an editor to remove soft hyphens.
I also use sed to automatically detect chapter headings and any subtitles, page headers, page numbers, and paragraphs.
SBT is offline   Reply With Quote
Old 06-18-2012, 05:32 AM   #14
Doitsu
Wizard
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 1,989
Karma: 4633978
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by SBT View Post
I also use sed to automatically detect chapter headings and any subtitles, page headers, page numbers, and paragraphs.
Could you please post your sed script(s)?
Doitsu is offline   Reply With Quote
Old 06-18-2012, 06:20 AM   #15
Iznogood
Guru
Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.
 
Iznogood's Avatar
 
Posts: 929
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
I also use sed and/or other tools for regex search and replace, but my methods are based on "heuristics" rather than scripts, because the output from the OCR program depends on the input. So my opinion is that detection of chapters is best done manually in each case, but when you have seen the pattern of the html file, you can batch search and replace for such elements as chapters, page breaks etc
Iznogood is offline   Reply With Quote
Reply

Tags
ocr, proof-reading

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
ABBYY FineReader - Proof reading tips? PieOPah Workshop 23 03-02-2012 01:03 AM
Proof reading: What do you do when you find a clear misprint? graycyn Workshop 4 07-20-2011 01:13 PM
Proof Reading Service genepool General Discussions 1 03-16-2011 09:02 AM
What is easier on your eyes while reading. JeremyZ General Discussions 32 08-28-2010 05:58 PM
Reading methodology (list ordering) Be Szpilman Reading Recommendations 27 07-31-2008 08:44 PM


All times are GMT -4. The time now is 01:36 AM.


MobileRead.com is a privately owned, operated and funded community.