Quote:
Originally Posted by Jellby
|
Thanks for this link, you always seem to post it, and I always seem to forget about it. I should try to embed this into my brain.
Quote:
Originally Posted by AlexBell
Thanks, Tex2002an, #22. I'm afraid I haven't kept a record. As I remember many of them were , instead of . and vice versa, and I instead of ! and vice versa. But many of them just shouldn't have been there at all.
|
Ahh, that is too bad. Does nobody else save all the versions of the file as they work on them?
I tend to mark all of my files with [YYYY.MM.DD] and just save them as I go along. Therefore in the future, I could easily use code comparison tools on the EPUBs to see exactly what has changed between versions.
Quote:
Originally Posted by AlexBell
The pdf originals from which the ePub files I used were made were of quite poor quality - though that's no excuse.
|
Can you link to the Archive.org versions you used + your completed EPUB?
Side Note: Here are a few common OCR errors I ran into tonight:
oŁ -> of
tbe -> the
lias -> has
Roman Numeral Problems with the "V" OCRing as "Y":
Chapter XY -> Chapter XV
Chapter Y -> Chapter V
Chapter XYI -> Chapter XVI
CHAPTER XXIY -> CHAPTER XXIV
CHAPTER XXYI -> CHAPTER XXVI
Punctuation Errors (em dash + hyphen):
—- -> —
-— -> —
You may also want to look out for hyphens followed by a space. This needs to be decided on a case-by-case basis, because many of these are valid. Example, "This is a one- or two-hyphen error." In many cases it is either a badly recognized soft hyphen (end of line or end of page), a speck of dust, or an actual OCR error.
You may also want to make a pass looking for <sup> or <sub> tags. Sometimes OCR just goes crazy and inserts this into the text.