MobileRead Forums - View Single Post

Iznogood · 06-07-2011, 08:06 AM

My next post is a little bit off topic, but I take my chances of hijacking my own thread to ask a few questions about the step after OCR, namely proofreading. If the moderators find this unapropriate, I ask for forgiveness and agree that this is moved to a new thread.

I like ABBYYs interface for proofreading, where I can put the cursor in the text document, and the program highlights the corresponding character in the scanned image so that I easily can check if ABBYY recognized that character correctly (image attached).

I have a few books that I really, really want to have in digital form, and I am willing to do a good deal of work to get the results as near as perfect as possible for these books. First book out is Death-Watch, which is done as an experimental book.

My current workflow:

Scan the pages from the book, two pages pr. image file. I use ABBYY to do this, scanning in automatic mode, no preprocessing or page splitting allowed. I only scan the area of the flatbed the book occupies, so no trimming is necessary. I scan with resolution 600dpi grayscale. Covers and such scanned as 600dpi colour.
Store the images on disc as png.
Make a copy of source images. Correct white balance, and do manual preprocessing. This is done in Windows Picture Manager since it has the capability of doing this in batch. I do this on some pages to improve chances of successful OCR.
Load preprocessed images into Finereader. Preprocessing and image splitting is now allowed.
Run OCR
Read through the recognized text and compare it with the source image in ABBYYs proofreader. I read through the entire book this way. Correcting errors, inserting marks for later search and replace (ABBYY does not support curly quotations, so I add an asterix to mark which quotation is left and right, "quotation text" => *"quotation text"*, and later search and replace *" and "* with appropriate quotation). I know that this can be done with regex afterwards, but I feel more in control over the quotation marks this way
I export each recognized page to one html file. Thereafter I have a script that runs through all files, but inserts <span class="newpage" id="pageXXX"/> on every pagebreak (not neccessary, but I like it this way), and merges all files into one big html file. After that, search and replace in batch is performed to clean up quotations, unwanted markup, and similar.
Split HTML in chapters, fixing formating, controlling that all paragraphs have been detected correctly (by checking every paragraph up against the source images) + completing epub file
Read the book on my reader comparing it to the printed book. This is the best step of them all

My goal is to get the results as perfect and as close to the original as possible. Even the typos of the original and a page breaks (span class="newpage" with property display:none) are lovingly preserved. And, as a nitpicker, I also require good markup, and no computer tool is good at applying proper tagging. ABBYY recognizes all headlines as <p>. To get them stored as <hx>, I have to do that manually in the editor, or do it manually in the html file. I prefer the latter.

Nevertheless, I find steps 1 to 9 abowe very time consuming. Especially is the steps 6 to 8 unneccessary redundant, as far as I can see. First, I proofread the text, and thereafter I have to go through it all again in the HTML file to ensure proper markup and formatting. I only do step 6 because it is much easier to do it in ABBYY instead of having two open windows on my desktop and manually try to compare the HTML code and the source image. More errors escape me that way.

I would like to do this in one step: export directly to HTML, split html code to chapters, create .ncx and .opf files, cover files and so on, and thereafter proofreading the html code and compare it to the source image.

Does anybody know of a program that kan keep the source image (either as images or as searchable PDf files with recognized text under the image) in sync, so that I don't have to read the html code, and forget my position in it before I find the corresponding position in the source and can compare them? The same feature as ABBYYs proofreader would be nice: highlighting of the character or word now marked in the code. ABBYY would be perfect for proofreading if it allowed to change markup and regex search and replace, but I fear it is a waste of time to proofread before the final version, i.e. before quotations, markup and so on is corrected.

Anybody with tips? How do you professionals perform proofreading?

06-07-2011, 08:06 AM	#12
Iznogood Guru Posts: 932 Karma: 15752887 Join Date: Mar 2011 Location: Norway Device: Ipad, kindle paperwhite	My next post is a little bit off topic, but I take my chances of hijacking my own thread to ask a few questions about the step after OCR, namely proofreading. If the moderators find this unapropriate, I ask for forgiveness and agree that this is moved to a new thread. I like ABBYYs interface for proofreading, where I can put the cursor in the text document, and the program highlights the corresponding character in the scanned image so that I easily can check if ABBYY recognized that character correctly (image attached). I have a few books that I really, really want to have in digital form, and I am willing to do a good deal of work to get the results as near as perfect as possible for these books. First book out is Death-Watch, which is done as an experimental book. My current workflow: Scan the pages from the book, two pages pr. image file. I use ABBYY to do this, scanning in automatic mode, no preprocessing or page splitting allowed. I only scan the area of the flatbed the book occupies, so no trimming is necessary. I scan with resolution 600dpi grayscale. Covers and such scanned as 600dpi colour. Store the images on disc as png. Make a copy of source images. Correct white balance, and do manual preprocessing. This is done in Windows Picture Manager since it has the capability of doing this in batch. I do this on some pages to improve chances of successful OCR. Load preprocessed images into Finereader. Preprocessing and image splitting is now allowed. Run OCR Read through the recognized text and compare it with the source image in ABBYYs proofreader. I read through the entire book this way. Correcting errors, inserting marks for later search and replace (ABBYY does not support curly quotations, so I add an asterix to mark which quotation is left and right, "quotation text" => "quotation text", and later search and replace " and " with appropriate quotation). I know that this can be done with regex afterwards, but I feel more in control over the quotation marks this way I export each recognized page to one html file. Thereafter I have a script that runs through all files, but inserts <span class="newpage" id="pageXXX"/> on every pagebreak (not neccessary, but I like it this way), and merges all files into one big html file. After that, search and replace in batch is performed to clean up quotations, unwanted markup, and similar. Split HTML in chapters, fixing formating, controlling that all paragraphs have been detected correctly (by checking every paragraph up against the source images) + completing epub file Read the book on my reader comparing it to the printed book. This is the best step of them all My goal is to get the results as perfect and as close to the original as possible. Even the typos of the original and a page breaks (span class="newpage" with property display:none) are lovingly preserved. And, as a nitpicker, I also require good markup, and no computer tool is good at applying proper tagging. ABBYY recognizes all headlines as <p>. To get them stored as <hx>, I have to do that manually in the editor, or do it manually in the html file. I prefer the latter. Nevertheless, I find steps 1 to 9 abowe very time consuming. Especially is the steps 6 to 8 unneccessary redundant, as far as I can see. First, I proofread the text, and thereafter I have to go through it all again in the HTML file to ensure proper markup and formatting. I only do step 6 because it is much easier to do it in ABBYY instead of having two open windows on my desktop and manually try to compare the HTML code and the source image. More errors escape me that way. I would like to do this in one step: export directly to HTML, split html code to chapters, create .ncx and .opf files, cover files and so on, and thereafter proofreading the html code and compare it to the source image. Does anybody know of a program that kan keep the source image (either as images or as searchable PDf files with recognized text under the image) in sync, so that I don't have to read the html code, and forget my position in it before I find the corresponding position in the source and can compare them? The same feature as ABBYYs proofreader would be nice: highlighting of the character or word now marked in the code. ABBYY would be perfect for proofreading if it allowed to change markup and regex search and replace, but I fear it is a waste of time to proofread before the final version, i.e. before quotations, markup and so on is corrected. Anybody with tips? How do you professionals perform proofreading? Attached Thumbnails Last edited by Iznogood; 06-07-2011 at 08:36 AM. Reason: Typo pluss missing attachment