View Single Post
Old 07-15-2016, 04:36 AM   #26
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Jellby View Post
DP has a list of some words that will not be detected by a spell checker, but are most probably OCR errors (scannos), among them the infamous "arid" (for and) and "modem" (for modern):

http://www.pgdp.net/c/faq/wordcheck-...ite_word_lists
Thanks for this link, you always seem to post it, and I always seem to forget about it. I should try to embed this into my brain.

Quote:
Originally Posted by AlexBell View Post
Thanks, Tex2002an, #22. I'm afraid I haven't kept a record. As I remember many of them were , instead of . and vice versa, and I instead of ! and vice versa. But many of them just shouldn't have been there at all.
Ahh, that is too bad. Does nobody else save all the versions of the file as they work on them?

I tend to mark all of my files with [YYYY.MM.DD] and just save them as I go along. Therefore in the future, I could easily use code comparison tools on the EPUBs to see exactly what has changed between versions.

Quote:
Originally Posted by AlexBell View Post
The pdf originals from which the ePub files I used were made were of quite poor quality - though that's no excuse.
Can you link to the Archive.org versions you used + your completed EPUB?

Side Note: Here are a few common OCR errors I ran into tonight:

oŁ -> of
tbe -> the
lias -> has

Roman Numeral Problems with the "V" OCRing as "Y":

Chapter XY -> Chapter XV
Chapter Y -> Chapter V
Chapter XYI -> Chapter XVI
CHAPTER XXIY -> CHAPTER XXIV
CHAPTER XXYI -> CHAPTER XXVI

Punctuation Errors (em dash + hyphen):

—- -> —
-— -> —

You may also want to look out for hyphens followed by a space. This needs to be decided on a case-by-case basis, because many of these are valid. Example, "This is a one- or two-hyphen error." In many cases it is either a badly recognized soft hyphen (end of line or end of page), a speck of dust, or an actual OCR error.

You may also want to make a pass looking for <sup> or <sub> tags. Sometimes OCR just goes crazy and inserts this into the text.

Last edited by Tex2002ans; 07-15-2016 at 04:48 AM.
Tex2002ans is offline   Reply With Quote