View Single Post
Old 08-29-2024, 12:26 PM   #1
Shohreh
Addict
Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.
 
Posts: 207
Karma: 304158
Join Date: Jan 2016
Location: France
Device: none
Question [SOLVED] [OCR] Extract text layer, fix errors, re-import?

Hello,

I notice some typos in the text layer added by an OCR into a "bitmap" PDF, ie. pages are actually scanned pages.

I first tried opening the EPUB generated by Abbyy Finereader, but LibreOffice couldn't open it at all, while Sigil could after showing an error message but lacks a French dictionary to run the job (as far as I can tell).

As an alternative, pdftotext or mutool (convert) can extract the text layer from such PDF, but can they put it back after I fixed the typos?

Thank you.

--
Edit: An easy solution is to convert the PDF to EPUB using Abbyy Finereader, and then run the HTML files within through a spellchecker.

Last edited by Shohreh; 08-30-2024 at 03:28 AM.
Shohreh is offline   Reply With Quote