View Single Post
Old 07-24-2020, 03:49 PM   #39
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Shohreh View Post
Because FineReader did not carry formatting,
What did you export as?

You should see italics/bold showing up in the right half of Finereader:

Click image for larger version

Name:	Finereader.Left.Right.Halves.png
Views:	435
Size:	280.2 KB
ID:	180911

Left should display the original document, and the Right half should show all the actual OCRed text.

Did you select Document Layout: "Formatted Text". In the dropdown, you can also select DOCX:

Click image for larger version

Name:	Finereader.Formatted.Text.png
Views:	415
Size:	9.3 KB
ID:	180910

(Personally, I keep everything on "Exact Copy" until I'm ready to export the document. This makes the Left/Right halves match much more closely, making it easier to make corrections.)

Quote:
Originally Posted by Shohreh View Post
I wanted to try other tools, especially since the PDF contained two layers, so it made sense to extract the "text" layer and see how it compared with running the PDF through FineReader.
I'm going to let you know right now, almost always the text layer is a garbled mess.

It's almost always better to re-OCR and work from scratch (see the PDF+OCR topics I previously linked to).

Quote:
Originally Posted by Shohreh View Post
Re-add formatting (bold, italics, etc.)
Yes, exactly, which is why you want the computer doing that.

Finereader does a better job than any other tool at carrying over this (along with superscript/subscript/tables, etc. etc.).

Quote:
Originally Posted by Shohreh View Post
Some hyphenated words weren't corrected by FineReader (but much better than starting from raw text from pdttotext, since FineReader uses a dictionary to fix most of those)
Yes, the soft/hard hyphen is a problem for anything, but again, Finereader seems to handle these the best.

Still a lot of manual correction needs to be done though, and that's where you use some of the tricks I listed in Post #15.

Spellcheck Lists are a fantastic way to catch/correct these things, along with Regex.

Quote:
Originally Posted by Shohreh View Post
Re-add footnotes
Yep, that one's a pain, but there are methods.

Quote:
Originally Posted by Shohreh View Post
Takes pictures of tables and… pictures, and insert them
Finereader should detect all that, and if not, you adjust the recognition boxes.

I explained some of this back in 2014: Post #5 in "Problems converting K2PDF Opt files to EPUB".

Quote:
Originally Posted by Shohreh View Post
Build a ToC
As long as your headings are marked fine (<h1> <h2> <h3> ...), you regenerate that from Sigil.

Last edited by Tex2002ans; 07-24-2020 at 03:58 PM.
Tex2002ans is offline   Reply With Quote