Can't extract text in image for MOBI/AZW3, despite using OCR, in Calibre for Kindle

ck18ss@brocku.ca · 08-15-2022, 12:48 PM

Hi,

I am trying to tell Calibre to read the text in a MOBI/AZW3 file for my Kindle paperwhite, but it is reading each page of the PDF that I am converting as ".pngs". I first used OCR on a PDF that I could not select the text for. After being able to select the text, I converted the PDF to a MOBI file. It works in my Kindle but the text is very small because it's reading the images instead of the text.

How should I go about this? Please let me know if I need to clarify things further. Thanks for the assistance.

retiredbiker · 08-15-2022, 05:34 PM

Getting a pdf into a fully readable flowing text book is blood, sweat and tears. There is no magic; it all depends on what is in the pdf, which can be anything. Read this sticky post for a lot of the troubles: https://www.mobileread.com/forums/sh...d.php?t=118605.

It sounds like yours had just images of the text, with no initial text layer at all. Like a lot of Internet Archive pdfs.

There are tools out there like "ocrmypdf" that sound great but rarely give good results. You don't mention what you used. They work on the whole file and leave you with headers, footers, page numbers, and usually each line as a paragraph. May be OK for making a pdf searchable, but not for conversion.

I've done many of these conversions. I use the tool "pdfimages" to get the pictures out of the pdf, and then, if necessary, I use ImageMagick to process the images into something my OCR is happy with. Dealing with the existing pdf is usually no fun at all.

Then I use the OCRFeeder front end to Tesseract to do the OCR. I do it one page at a time, and place the text into a LibreOffice document as I go. I can avoid headers and other artefacts that will make a mess of it. OCRFeeder is great at recognising paragraphs, end-of-line hyphens, and so on. So I do the basic formatting as I go along, making scene breaks and chapters styled as I like. USE STYLES FOR ALL FORMATTING, not the toolbar, if you want a good conversion.

I proofread the result for scannos, usually a chapter at a time, in LibreOffice. Then save one last time as docx, and convert to epub. Then I edit the epub to fix the inevitable problems that will still be there, make a correct ToC, and so on.

Then I will proofread again on a Kobo, or convert to an azw3 (never mobi, ancient useless things) and use a Kindle. Either way, I correct the master copy as I go along. I expect to spend 25 to 40 hours on a smallish novel, depending on the quality of the images and the OCR error rate.

That is for a book you might share with somebody and be proud of. If you just want quick and dirty, go ahead and use something like ocrmypdf. Calibre uses "pdftohtml" to find the text during comversion, and sometimes it doesn't work. That sounds like your case. Outside Calibre, try using "pdftotext" and you will probably get a text file. It will probably convert, but it sure won't be pretty.

08-15-2022, 12:48 PM	#1
ck18ss@brocku.ca Junior Member Posts: 1 Karma: 10 Join Date: Aug 2022 Device: kindle	Can't extract text in image for MOBI/AZW3, despite using OCR, in Calibre for Kindle Hi, I am trying to tell Calibre to read the text in a MOBI/AZW3 file for my Kindle paperwhite, but it is reading each page of the PDF that I am converting as ".pngs". I first used OCR on a PDF that I could not select the text for. After being able to select the text, I converted the PDF to a MOBI file. It works in my Kindle but the text is very small because it's reading the images instead of the text. How should I go about this? Please let me know if I need to clarify things further. Thanks for the assistance.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Tool to OCR an "image" PDF → add text as extra layer?	Shohreh	PDF	5	12-19-2020 12:47 PM
How to extract text and images from an .mobi file (ebook)?	Arkadya	Workshop	7	02-28-2019 05:14 AM
Calibre sends Newsfeeds as mobi instead of AZW3 to Kindle	syntaxis	Calibre	2	06-07-2014 02:25 AM
Image and Text problem from epub to mobi for Kindle DX	congngo	Conversion	0	12-05-2011 04:48 PM
PDF Image -> OCR -> text	frikk	Workshop	9	07-08-2009 07:21 PM

08-15-2022, 05:34 PM	#2
retiredbiker Addict Posts: 387 Karma: 1638210 Join Date: May 2013 Location: Ontario, Canada Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma	Getting a pdf into a fully readable flowing text book is blood, sweat and tears. There is no magic; it all depends on what is in the pdf, which can be anything. Read this sticky post for a lot of the troubles: https://www.mobileread.com/forums/sh...d.php?t=118605. It sounds like yours had just images of the text, with no initial text layer at all. Like a lot of Internet Archive pdfs. There are tools out there like "ocrmypdf" that sound great but rarely give good results. You don't mention what you used. They work on the whole file and leave you with headers, footers, page numbers, and usually each line as a paragraph. May be OK for making a pdf searchable, but not for conversion. I've done many of these conversions. I use the tool "pdfimages" to get the pictures out of the pdf, and then, if necessary, I use ImageMagick to process the images into something my OCR is happy with. Dealing with the existing pdf is usually no fun at all. Then I use the OCRFeeder front end to Tesseract to do the OCR. I do it one page at a time, and place the text into a LibreOffice document as I go. I can avoid headers and other artefacts that will make a mess of it. OCRFeeder is great at recognising paragraphs, end-of-line hyphens, and so on. So I do the basic formatting as I go along, making scene breaks and chapters styled as I like. USE STYLES FOR ALL FORMATTING, not the toolbar, if you want a good conversion. I proofread the result for scannos, usually a chapter at a time, in LibreOffice. Then save one last time as docx, and convert to epub. Then I edit the epub to fix the inevitable problems that will still be there, make a correct ToC, and so on. Then I will proofread again on a Kobo, or convert to an azw3 (never mobi, ancient useless things) and use a Kindle. Either way, I correct the master copy as I go along. I expect to spend 25 to 40 hours on a smallish novel, depending on the quality of the images and the OCR error rate. That is for a book you might share with somebody and be proud of. If you just want quick and dirty, go ahead and use something like ocrmypdf. Calibre uses "pdftohtml" to find the text during comversion, and sometimes it doesn't work. That sounds like your case. Outside Calibre, try using "pdftotext" and you will probably get a text file. It will probably convert, but it sure won't be pretty.

Advert