|
|
Thread Tools | Search this Thread |
10-24-2022, 08:15 AM | #1 |
Enthusiast
Posts: 49
Karma: 26
Join Date: Jan 2022
Device: none
|
OCR'd PDF to EPUB/TXT/etc. not copying text over (text under image).
I made a searchable OCR'd PDF in ABBYY with the "save text under image" setting (this is what OCR software usually defaults to, so that it displays the original scan in case the OCR made a mistake). Whenever I try to convert the PDF in Calibre, though, it ignores the included OCR'd text and just spits out the original images. How can I resolve this? I'm able to copy and paste the OCR'd text, so I know it's not a problem with the PDF. Calibre just isn't seeing the text for some reason.
Last edited by Tenome; 10-24-2022 at 08:39 AM. |
10-24-2022, 10:17 AM | #2 |
Evangelist
Posts: 420
Karma: 2737916
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma
|
I have found some pdfs that have this problem. Calibre uses the pdftohtml tool to pull the text out of a pdf, and for some reason that can fail. Take the pdf out of Calibre and try using the pdftohtml tool from the command line and you get nothing, but try the pdftotext tool and you usually do get the text. I've never seen an answer on why some text that is definitely there does not respond to pdftohtml. Another example of the evil behaviour of pdfs!
|
Advert | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Can't extract text in image for MOBI/AZW3, despite using OCR, in Calibre for Kindle | ck18ss@brocku.ca | Conversion | 1 | 08-15-2022 05:34 PM |
Tool to OCR an "image" PDF → add text as extra layer? | Shohreh | 5 | 12-19-2020 12:47 PM | |
Best practice to OCR and convert PDF to text or html or epub | crankypants | ePub | 15 | 12-14-2015 08:00 PM |
EPUB -> PDF: Image Rather Than Text | claytoncarney | Conversion | 3 | 01-03-2013 12:15 PM |
PDF Image -> OCR -> text | frikk | Workshop | 9 | 07-08-2009 07:21 PM |