MobileRead Forums - View Single Post - OCR'd PDF to EPUB/TXT/etc. not copying text over (text under image).

retiredbiker · 10-24-2022, 10:17 AM

I have found some pdfs that have this problem. Calibre uses the pdftohtml tool to pull the text out of a pdf, and for some reason that can fail. Take the pdf out of Calibre and try using the pdftohtml tool from the command line and you get nothing, but try the pdftotext tool and you usually do get the text. I've never seen an answer on why some text that is definitely there does not respond to pdftohtml. Another example of the evil behaviour of pdfs!

10-24-2022, 10:17 AM	#2
retiredbiker Evangelist Posts: 475 Karma: 3972002 Join Date: May 2013 Location: Ontario, Canada Device: Pop_Os!, Kobo Forma	I have found some pdfs that have this problem. Calibre uses the pdftohtml tool to pull the text out of a pdf, and for some reason that can fail. Take the pdf out of Calibre and try using the pdftohtml tool from the command line and you get nothing, but try the pdftotext tool and you usually do get the text. I've never seen an answer on why some text that is definitely there does not respond to pdftohtml. Another example of the evil behaviour of pdfs!