View Single Post
Old 11-29-2012, 09:05 AM   #35
ath
Addict
ath doesn't litterath doesn't litter
 
Posts: 222
Karma: 110
Join Date: Jun 2006
Location: Malmo, Sweden
Device: iLiad, Sony PRS-505, Kindle Paperwhite & Oasis
Quote:
Originally Posted by neuvivlio View Post
how could i take this roughly scanned book, and convert the text into nice clean, legible text?
Noone seems to have said much about process, so let me add a few points.

Transcription. This is basically what OCR does, but you need to ensure that it does the right thing. (Look at recent Kindle versions of Ian Banks novels, particularly on pages where conversation-between-Minds is presented: the printed books do indentation in a way similar to epost; many eBooks make this into a mess.) That is, you need to look at typographical presentation in the original, and how you plan to represent this is the resulting text: italics, small caps, quotations, anything.

OCR may mess up end-of-line hyphenation: decide on how you want to handle it: keep it as is, or 'restore' the hyphenated words?

Proofreading of transcription. Exactly what it sounds like. Proofread in a 'good' typeface, where you can clearly see the differences between '1', 'I', and 'l', and so on. (I like Palatino and related faces.) Also look for invisible things: two spaces in a row, confusion between 'O' and '0', dashes of the wrong length, hard line breaks coinciding with visual line breaks, etc. If you worked with your text in a Word-like environment, check paragraph and character formatting, as OCR-produced formats can be off a few points here and there. Spelling-checking tends to go here, but you may need to take the original text into account: old texts don't always use modern spelling.

Layout and formatting. This is where you do your own stuff.

Just don't imagine you won't need to proofread everything again: a book (the result of a publishing process) needs to 'work' in different ways than a text (the result of a transcription process). You may need to add discretionary hyphens and no-break-spaces in critical places, to get what you want, for example.

If you are really careful, you do copy-editing as well, verifying that your source spells and hyphenates words consistently throughout, and perhaps even change old-fashioned 'Mr.' and 'Dr.' etc. to more modern style (without the periods), and so on.

Of course, it all depends on what you plan. If you don't plan for anyone else to read the result, do what you like, But if you do plan for other readers ...

I recently read a recent reissue of a novel by Eric Ambler (The Schirmer Inheritance) on the Kindle . I am still dismayed by the amount of OCR errors that had been allowed to remain in the text. Some were obvious: 'li' where the original had a 'h', 'rn' where the original had an 'm', and so on. And the number of these increased as the book progressed: the proofreaders probably got tired ... It's not an eBook I will return to.
ath is offline   Reply With Quote