View Single Post
Old 04-29-2023, 04:07 AM   #7
feuille
Connoisseur
feuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enough
 
Posts: 62
Karma: 666
Join Date: May 2020
Location: Germany
Device: android smartphone + tablet
Quote:
Originally Posted by Quoth View Post
However I think text from PDF images via OCR is a workflow best done before the final version is added to the Library.
Agreed when it comes to creating a new format from a scan. I do it so, too.
In my use case, however, it is about adding a text layer to a PDF whose layout should not be changed, for example to enable full-text search (FTS) and text extraction.
In my experience, with a good scan and the correct configuration of the OCR software (Tesseract), the recognition errors are relatively small and hardly affect the FTS. If you copy a piece of text for a quote, you can easily check it against the original layout. Also, before writing the text layer, I plan to offer the scan result in a text editor, with the possibility of proofreading.
feuille is offline   Reply With Quote