MobileRead Forums - View Single Post

Tex2002ans · 07-13-2020, 05:18 PM

Quote:

Originally Posted by patrik

Do you still use Scan Tailer if you are going to use Finereader afterwards?

Scan Tailor Advanced is best used as a pre-OCR step.

Really only used if you have ugly input that needs serious cleaning.

You mentioned taking pictures with your smartphone, so that would cause issues like:

Rotation
Spine showing
- Can easily cut Left/Right pages, or crop the spine out of the image.
Bent pages (thus bent/wavy lines of text)
- It can dewarp them to become straight.
Uneven Lighting/Color (Yellowed Pages)
- When trying to grayscale/B&W, you could get a "ring" or tons of black speckles.

So Scan Tailor would take you from something like this:

Click image for larger version

Name: Page16.jpg
Views: 773
Size: 1.44 MB
ID: 180573

Click image for larger version

Name: Page17.jpg
Views: 731
Size: 1.37 MB
ID: 180574

to this:

Attachment 177415

(Those images were from the book in the "Optimize PDFs" thread.)

Related Side Note: I also gave an example of handling OCR + images in "How to handle images in books while doing OCR of books?".

Quote:

Originally Posted by patrik

BTW, I recently got a new scanner with a fairly good software (which uses Finereader for ocr).

Is it the full Finereader? Or just some instant scan -> PDF/DOCX thing?

If it's the full Finereader, you should be able to open it up and have an Original+OCR split in the Left/Right windows.

See my posts in 2013, "Best way to copy text from a PDF or MOBI?". This lets you easily see a magnified version of the exact location in the book, and make sure the text is correct.

That's exactly how I squash most errors... right at the source!

Quote:

Originally Posted by patrik

The best output of it is docx.

That's one thing I've changed within the past few years... now I trust Toxaris's EPUB Tools to clean up Finereader's cruft.

When you export, change Finereader to "Formatted Text" and DOCX. Toxaris's EPUB Tools will then clean up the rest.

From there, you could do further cleaning in DOCX (if that's what you're comfortable with), or get it into EPUB as soon as possible + do your cleaning there (that's what I prefer).

Quote:

Originally Posted by patrik

But I miss the "verify text" step. Have you, or anyone else, find a better, or at least equal way to go through the text to find errors?

There's still multiple rounds of proofing that has to be done. Nothing gets rid of that.

As always, it's best to squash this stuff as close to the source as possible.

1. Clean Input Images = More Accurate OCR

The cleaner the input, the less time wasted fixing errors.

2. Mark/Proofread in Finereader

This is where you make sure "big picture" things are marked—Text, Images, Tables, Headers/Footers.

Then it's helpful to focus on all the "blue highlights" (unsure characters) and fix as many of those as you can.

Also making sure things like bold/italics/superscripts are carried over properly.

3. Export DOCX (or EPUB) out of Finereader

Do further cleanup.

Toxaris's EPUB Tools merges accidental split paragraphs together, etc.

You may have to re-correct "odd" line breaks that may have accidentally been merged, for example, poetry.

If you're comfortable with Word, you may want to add in some more Styles/formatting here (headings, blockquotes, captions, [...]).

4. Clean the EPUB

This is where you also make sure all the little things are correct:

Headings are <h1>-<h6>
Paragraphs are correct
Indentation is correct
Blockquotes are <blockquote>s
Footnotes are footnotes
Left/Center/Right alignment
[...]

And with Sigil/Calibre, you have access to more powerful tools/Regex.

For example, one of my favorite tricks is still to search for all hyphenated words in the Spellcheck Lists (I wrote about that all the way back in 2013!).

And now that "numbers are words", you can use a similar trick to find whole classes of OCR errors (0<->O, 1<->l). (See "Suggestion: Spellcheck Enhancement (Numbers)").

5. Run through a final Spellcheck/Grammarcheck pass

See my 2018 post in, "Does Tool Exist to Spellcheck/Grammarcheck by Category?".

If you spellchecked in Sigil/Calibre, maybe try Word (different dictionaries may point out other misspellings).

If you grammarchecked in Word, maybe try LanguageTool or Antidote. Different tools might catch different errors.

And I definitely run EPUB Tools's Dialogue Check—it's the best damn thing since sliced bread, and it catches all the mismatching quotation marks + parentheses/brackets.

Quote:

Originally Posted by patrik

You have no idea how many notes I have due to your posts.

I'd be interested in learning what sorts of things you marked down in your notes.

I've been trying to put together an "FAQ"-type series of posts for the blog... and I have no idea what sorts of things people found useful over the years.

PM me if you don't want to type about it here. (Wouldn't want to derail this thread.)