MobileRead Forums - View Single Post

Tex2002ans · 07-24-2020, 03:49 PM

Quote:

Originally Posted by Shohreh

Because FineReader did not carry formatting,

What did you export as?

You should see italics/bold showing up in the right half of Finereader:

Click image for larger version

Name: Finereader.Left.Right.Halves.png
Views: 435
Size: 280.2 KB
ID: 180911

Left should display the original document, and the Right half should show all the actual OCRed text.

Did you select Document Layout: "Formatted Text". In the dropdown, you can also select DOCX:

Click image for larger version

Name: Finereader.Formatted.Text.png
Views: 415
Size: 9.3 KB
ID: 180910

(Personally, I keep everything on "Exact Copy" until I'm ready to export the document. This makes the Left/Right halves match much more closely, making it easier to make corrections.)

Quote:

Originally Posted by Shohreh

I wanted to try other tools, especially since the PDF contained two layers, so it made sense to extract the "text" layer and see how it compared with running the PDF through FineReader.

I'm going to let you know right now, almost always the text layer is a garbled mess.

It's almost always better to re-OCR and work from scratch (see the PDF+OCR topics I previously linked to).

Quote:

Originally Posted by Shohreh

Re-add formatting (bold, italics, etc.)

Yes, exactly, which is why you want the computer doing that.

Finereader does a better job than any other tool at carrying over this (along with superscript/subscript/tables, etc. etc.).

Quote:

Originally Posted by Shohreh

Some hyphenated words weren't corrected by FineReader (but much better than starting from raw text from pdttotext, since FineReader uses a dictionary to fix most of those)

Yes, the soft/hard hyphen is a problem for anything, but again, Finereader seems to handle these the best.

Still a lot of manual correction needs to be done though, and that's where you use some of the tricks I listed in Post #15.

Spellcheck Lists are a fantastic way to catch/correct these things, along with Regex.

Quote:

Originally Posted by Shohreh

Re-add footnotes

Yep, that one's a pain, but there are methods.

Quote:

Originally Posted by Shohreh

Takes pictures of tables and… pictures, and insert them

Finereader should detect all that, and if not, you adjust the recognition boxes.

I explained some of this back in 2014: Post #5 in "Problems converting K2PDF Opt files to EPUB".

Quote:

Originally Posted by Shohreh

Build a ToC

As long as your headings are marked fine (<h1> <h2> <h3> ...), you regenerate that from Sigil.