MobileRead Forums - View Single Post

feuille · 04-29-2023, 05:07 AM

Quote:

Originally Posted by Quoth

However I think text from PDF images via OCR is a workflow best done before the final version is added to the Library.

Agreed when it comes to creating a new format from a scan. I do it so, too.
In my use case, however, it is about adding a text layer to a PDF whose layout should not be changed, for example to enable full-text search (FTS) and text extraction.
In my experience, with a good scan and the correct configuration of the OCR software (Tesseract), the recognition errors are relatively small and hardly affect the FTS. If you copy a piece of text for a quote, you can easily check it against the original layout. Also, before writing the text layer, I plan to offer the scan result in a text editor, with the possibility of proofreading.