View Single Post
Old 06-10-2021, 10:36 PM   #6
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Greg Anos View Post
Let me fill you in on the background.
Great, thanks for more background information.

It sounds like the core issue is your very first OCR step:

Your Brother OCR is outputting crappy text, so now you're trying to come up with a whole complicated workflow to try to correct THAT mess.

But it's like the solid foundation of a building.

If we adjust that very first step + do it properly, every step after will be easier.

* * *

To come up with a better workflow...

First, a few questions:

1. Do you access to Microsoft Word?

2. Do you still have access to your old copy of Finereader 9?

Notes: If you have Microsoft Word, Toxaris's "EPUB Tools" makes DOCX/RTF OCR cleanup infinitely easier.

If you have Finereader, I can give Finereader-specific instructions.

If no Finereader, then I'd recommend Tesseract (Open Source OCR) instead of your Brother OCR program. (They probably rebranded/based theirs off an outdated version of Tesseract.)

Quote:
Originally Posted by Greg Anos View Post
If you use TXT, goodbye to all you bold, italics, ect. BUT removing the line feed character is a piece of cake for a hex editor.
No, no, no. DO NOT go with TXT.

The underlying formatting (bold/italics/superscript) is just as important as the text itself.

Why? I've written about this in detail back in:

Side Note: Linefeeds are also very easy to remove in RTF, DOCX, TXT, etc. You can use "Advanced Search" or Regular Expressions.

Usually:

/r = Carriage Return
/n = Line Feed

I also use Regular Expressions like:

Search: </p>\s+<p>([a-z])

to search for paragraph breaks that start with a lowercase letter.

No need to go crazy with hex editing files in order to locate/eliminate this stuff.

Quote:
Originally Posted by Greg Anos View Post
My OCR output choices are

TXT
RTF
HTML
XML

Which one would you rather use - to glue together 40 separate pages of text, and make them reflowable?
Out of that bad selection, RTF would most likely work better.

If you have access to Finereader 9, DOCX will be better.

Finereader 10 introduced EPUB output (which is what I used for many years, but now I swear by DOCX -> Toxaris's EPUB Tools).

If you don't have Finereader, then like I said above, probably best to use Tesseract. From there, you'd be able to output better/cleaner files.

Quote:
Originally Posted by Greg Anos View Post
I'm still missing 2 characters - a "c" with a cap on it Think the letter V (upside down, with the point of the v at the top) on the "c", and an "a" with a tilde; you know, the squiggle line over the n in Spanish. (it's a Portuguese feature.)
The "upside down v" is called a Circumflex.

The "squiggle" is called a Tilde.

The "two dots" above is called an Umlaut (or Diaeresis).

(I linked to the fantastic Wikipedia articles on them, they give you nicely organized lists of the letters with accents!)

And here's those 3 characters you mentioned in Unicode:

ã = U+00E3 = LATIN SMALL LETTER A WITH TILDE
ĉ = U+0109 = LATIN SMALL LETTER C WITH CIRCUMFLEX
ö = U+00F6 = LATIN SMALL LETTER O WITH DIAERESIS

Quote:
Originally Posted by Greg Anos View Post
There characters are only used for place names, and for scientific journal citations, in journals that are not in English. They are usually in Portuguese or French.I needed the umlauts, because some of the authors are German, with umlauts in their last names.
If you have Finereader, set the Document Language to:

Code:
English; French; German; Portuguese
If you're using Tesseract, do a similar thing.

DO NOT tell the documents they are "only English". When you enable these other languages, it allows the expanded Alphabets to be used. (As explained in that Fraktur thread I linked to in Post #4.)

One con from enabling other languages is:
  • you may get slightly more OCR errors introduced
    • An "o + two specks of dust" may be confused for 'ö'

but the time saved from OCR getting accented characters right will easily outweigh the time spent manually retyping/correcting.

Quote:
Originally Posted by Greg Anos View Post
This is a long haul, if you have better ways to do it, please let me know. (I've done 8 out of 180, so far.)
You may also be interested in this recent topic posted by anonlivros:

While his tutorial was dealing with how to take pictures to "scan" + clean a book...

I summarized A TON of my "cleanup images and get them OCRed into ebooks" knowledge in there too.

Lots of reading/learning, but I guarantee you'll save way more time in the long-run.

Last edited by Tex2002ans; 06-10-2021 at 11:42 PM.
Tex2002ans is offline   Reply With Quote