OCRmyPDF adds OCR text layer to scanned PDF files

orebmur · 01-20-2018, 06:16 PM

Just stumbled over this wonderful tool:

github.com/jbarlow83/OCRmyPDF

* Generates a searchable PDF/A file from a regular PDF
* Places OCR text accurately below the image to ease copy / paste
* Keeps the exact resolution of the original embedded images
* When possible, inserts OCR information as a "lossless" operation without rendering vector information
* Keeps file size about the same
* If requested deskews and/or cleans the image before performing OCR
* Validates input and output files
* Provides debug mode to enable easy verification of the OCR results
* Processes pages in parallel when more than one CPU core is available
* Uses Tesseract OCR engine
* Supports more than 100 languages recognized by Tesseract
* Battle-tested on thousands of PDFs, a test suite and continuous integration

There is an official package in Debian Linux for those using Linux.

I have used it so far to postprocess both a Spanish and English language PDF of my own making, and i am very happy with the results.

01-20-2018, 06:16 PM	#1
orebmur Veteran Linux user Posts: 150 Karma: 1000000 Join Date: Mar 2017 Location: Barcelona/Spain Device: Boyue Likebook Note & Mimas, Hisense A5, hopefully soon a PineNote	OCRmyPDF adds OCR text layer to scanned PDF files Just stumbled over this wonderful tool: github.com/jbarlow83/OCRmyPDF * Generates a searchable PDF/A file from a regular PDF * Places OCR text accurately below the image to ease copy / paste * Keeps the exact resolution of the original embedded images * When possible, inserts OCR information as a "lossless" operation without rendering vector information * Keeps file size about the same * If requested deskews and/or cleans the image before performing OCR * Validates input and output files * Provides debug mode to enable easy verification of the OCR results * Processes pages in parallel when more than one CPU core is available * Uses Tesseract OCR engine * Supports more than 100 languages recognized by Tesseract * Battle-tested on thousands of PDFs, a test suite and continuous integration There is an official package in Debian Linux for those using Linux. I have used it so far to postprocess both a Spanish and English language PDF of my own making, and i am very happy with the results.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Is the PDF experience better with a text layer?	El Duderino	KOReader	16	08-04-2017 08:25 PM
Scanned text pdf with OCR but graphical layer instead vectorial	whopper	PDF	2	09-10-2011 06:32 PM
Google Adds OCR for PDF Files	kjk	News	0	06-22-2010 02:27 PM
Converting OCR Text files	jedavis1	Workshop	10	10-01-2009 10:09 PM
PDF Image -> OCR -> text	frikk	Workshop	9	07-08-2009 07:21 PM