View Single Post
Old 01-20-2018, 06:16 PM   #1
orebmur
Veteran Linux user
orebmur ought to be getting tired of karma fortunes by now.orebmur ought to be getting tired of karma fortunes by now.orebmur ought to be getting tired of karma fortunes by now.orebmur ought to be getting tired of karma fortunes by now.orebmur ought to be getting tired of karma fortunes by now.orebmur ought to be getting tired of karma fortunes by now.orebmur ought to be getting tired of karma fortunes by now.orebmur ought to be getting tired of karma fortunes by now.orebmur ought to be getting tired of karma fortunes by now.orebmur ought to be getting tired of karma fortunes by now.orebmur ought to be getting tired of karma fortunes by now.
 
Posts: 150
Karma: 1000000
Join Date: Mar 2017
Location: Barcelona/Spain
Device: Boyue Likebook Note & Mimas, Hisense A5, hopefully soon a PineNote
OCRmyPDF adds OCR text layer to scanned PDF files

Just stumbled over this wonderful tool:

github.com/jbarlow83/OCRmyPDF

* Generates a searchable PDF/A file from a regular PDF
* Places OCR text accurately below the image to ease copy / paste
* Keeps the exact resolution of the original embedded images
* When possible, inserts OCR information as a "lossless" operation without rendering vector information
* Keeps file size about the same
* If requested deskews and/or cleans the image before performing OCR
* Validates input and output files
* Provides debug mode to enable easy verification of the OCR results
* Processes pages in parallel when more than one CPU core is available
* Uses Tesseract OCR engine
* Supports more than 100 languages recognized by Tesseract
* Battle-tested on thousands of PDFs, a test suite and continuous integration

There is an official package in Debian Linux for those using Linux.

I have used it so far to postprocess both a Spanish and English language PDF of my own making, and i am very happy with the results.
orebmur is offline   Reply With Quote