View Single Post
Old 10-25-2011, 10:50 AM   #97
Elfwreck
Grand Sorcerer
Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.
 
Elfwreck's Avatar
 
Posts: 5,187
Karma: 25133758
Join Date: Nov 2008
Location: SF Bay Area, California, USA
Device: Pocketbook Touch HD3 (Past: Kobo Mini, PEZ, PRS-505, Clié)
Sample raw conversion

Quote:
Originally Posted by Hitch View Post
And, just for s&g's, I've asked several clients if I may use some of their pages here, for demonstration purposes (I do not know if I will obtain permission)--these clients had scan & OCR. I'm asking them if I may post 1-2 original pages from a PDF, and the resulting RAW scanned output;
I have samples. I do ebook conversions of public domain work that Gutenberg doesn't have, and a few other things.

This is a page from "Tales of Hoffman - Trial of the Chicago 8 7", which is not in the public domain, but majority of the text is, because trial transcripts are public domain. I figure that a page for educational purposes falls well within fair use, for the thirty words that may not be part of the transcript. (It's possible all of it is transcript.)

The PDF was scanned at 400dpi in Acrobat Pro (Which isn't the best, but is tolerable); the Word doc is auto-read in Finereader 7, after removing the page number. For this one, read quality's great; line breaks and the separating asterisks are the big problem.

Second sample is from "Magic and Fetishism," a public domain work available through Archive.org.

This one has more obvious problems. Extra punctuation caused by dots on the page, the foreign words are mostly misspelled, the punctuation is often wrong. And this is a good, clear scan of text that isn't tightly condensed.

Next sample: from Inglis' "Principles of Secondary Education," another PD book. This one's a nightmare for conversion; lots of tiny text in charts & tables.

I don't do most of the corrections in Word; I do them in Finereader, where I can see the text next to the scans, but that's not always an option.
Attached Files
File Type: pdf pg 85-Tales of Hoffman.pdf (56.2 KB, 170 views)
File Type: doc Pg 85 Hoffman RAW.doc (6.0 KB, 167 views)
File Type: pdf pg 64-66 Magic & Fetishism.pdf (1.32 MB, 177 views)
File Type: doc Pg 64-66 Magic & Fetishism.doc (6.9 KB, 173 views)
File Type: pdf INGLIS-2ary Educ-sample.pdf (274.3 KB, 162 views)
File Type: doc Inglis-Principles 2ary Educ-sample.doc (89.9 KB, 135 views)
Elfwreck is offline   Reply With Quote