OCR'd PDF to EPUB/TXT/etc. not copying text over (text under image).

Tenome · 10-24-2022, 08:15 AM

I made a searchable OCR'd PDF in ABBYY with the "save text under image" setting (this is what OCR software usually defaults to, so that it displays the original scan in case the OCR made a mistake). Whenever I try to convert the PDF in Calibre, though, it ignores the included OCR'd text and just spits out the original images. How can I resolve this? I'm able to copy and paste the OCR'd text, so I know it's not a problem with the PDF. Calibre just isn't seeing the text for some reason.

retiredbiker · 10-24-2022, 10:17 AM

I have found some pdfs that have this problem. Calibre uses the pdftohtml tool to pull the text out of a pdf, and for some reason that can fail. Take the pdf out of Calibre and try using the pdftohtml tool from the command line and you get nothing, but try the pdftotext tool and you usually do get the text. I've never seen an answer on why some text that is definitely there does not respond to pdftohtml. Another example of the evil behaviour of pdfs!

10-24-2022, 08:15 AM	#1
Tenome Enthusiast Posts: 49 Karma: 26 Join Date: Jan 2022 Device: none	OCR'd PDF to EPUB/TXT/etc. not copying text over (text under image). I made a searchable OCR'd PDF in ABBYY with the "save text under image" setting (this is what OCR software usually defaults to, so that it displays the original scan in case the OCR made a mistake). Whenever I try to convert the PDF in Calibre, though, it ignores the included OCR'd text and just spits out the original images. How can I resolve this? I'm able to copy and paste the OCR'd text, so I know it's not a problem with the PDF. Calibre just isn't seeing the text for some reason. Last edited by Tenome; 10-24-2022 at 08:39 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can't extract text in image for MOBI/AZW3, despite using OCR, in Calibre for Kindle	ck18ss@brocku.ca	Conversion	1	08-15-2022 05:34 PM
Tool to OCR an "image" PDF → add text as extra layer?	Shohreh	PDF	5	12-19-2020 12:47 PM
Best practice to OCR and convert PDF to text or html or epub	crankypants	ePub	15	12-14-2015 08:00 PM
EPUB -> PDF: Image Rather Than Text	claytoncarney	Conversion	3	01-03-2013 12:15 PM
PDF Image -> OCR -> text	frikk	Workshop	9	07-08-2009 07:21 PM

10-24-2022, 10:17 AM	#2
retiredbiker Evangelist Posts: 420 Karma: 2737916 Join Date: May 2013 Location: Ontario, Canada Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma	I have found some pdfs that have this problem. Calibre uses the pdftohtml tool to pull the text out of a pdf, and for some reason that can fail. Take the pdf out of Calibre and try using the pdftohtml tool from the command line and you get nothing, but try the pdftotext tool and you usually do get the text. I've never seen an answer on why some text that is definitely there does not respond to pdftohtml. Another example of the evil behaviour of pdfs!

Advert