Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 08-15-2022, 01:48 PM   #1
ck18ss@brocku.ca
Junior Member
ck18ss@brocku.ca began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Aug 2022
Device: kindle
Can't extract text in image for MOBI/AZW3, despite using OCR, in Calibre for Kindle

Hi,

I am trying to tell Calibre to read the text in a MOBI/AZW3 file for my Kindle paperwhite, but it is reading each page of the PDF that I am converting as ".pngs". I first used OCR on a PDF that I could not select the text for. After being able to select the text, I converted the PDF to a MOBI file. It works in my Kindle but the text is very small because it's reading the images instead of the text.

How should I go about this? Please let me know if I need to clarify things further. Thanks for the assistance.
ck18ss@brocku.ca is offline   Reply With Quote
Old 08-15-2022, 06:34 PM   #2
retiredbiker
Evangelist
retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.
 
retiredbiker's Avatar
 
Posts: 421
Karma: 2737916
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma
Getting a pdf into a fully readable flowing text book is blood, sweat and tears. There is no magic; it all depends on what is in the pdf, which can be anything. Read this sticky post for a lot of the troubles: https://www.mobileread.com/forums/sh...d.php?t=118605.

It sounds like yours had just images of the text, with no initial text layer at all. Like a lot of Internet Archive pdfs.

There are tools out there like "ocrmypdf" that sound great but rarely give good results. You don't mention what you used. They work on the whole file and leave you with headers, footers, page numbers, and usually each line as a paragraph. May be OK for making a pdf searchable, but not for conversion.

I've done many of these conversions. I use the tool "pdfimages" to get the pictures out of the pdf, and then, if necessary, I use ImageMagick to process the images into something my OCR is happy with. Dealing with the existing pdf is usually no fun at all.

Then I use the OCRFeeder front end to Tesseract to do the OCR. I do it one page at a time, and place the text into a LibreOffice document as I go. I can avoid headers and other artefacts that will make a mess of it. OCRFeeder is great at recognising paragraphs, end-of-line hyphens, and so on. So I do the basic formatting as I go along, making scene breaks and chapters styled as I like. USE STYLES FOR ALL FORMATTING, not the toolbar, if you want a good conversion.

I proofread the result for scannos, usually a chapter at a time, in LibreOffice. Then save one last time as docx, and convert to epub. Then I edit the epub to fix the inevitable problems that will still be there, make a correct ToC, and so on.

Then I will proofread again on a Kobo, or convert to an azw3 (never mobi, ancient useless things) and use a Kindle. Either way, I correct the master copy as I go along. I expect to spend 25 to 40 hours on a smallish novel, depending on the quality of the images and the OCR error rate.

That is for a book you might share with somebody and be proud of. If you just want quick and dirty, go ahead and use something like ocrmypdf. Calibre uses "pdftohtml" to find the text during comversion, and sometimes it doesn't work. That sounds like your case. Outside Calibre, try using "pdftotext" and you will probably get a text file. It will probably convert, but it sure won't be pretty.
retiredbiker is offline   Reply With Quote
Advert
Reply

Tags
calibre, ocr

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Tool to OCR an "image" PDF → add text as extra layer? Shohreh PDF 5 12-19-2020 01:47 PM
How to extract text and images from an .mobi file (ebook)? Arkadya Workshop 7 02-28-2019 06:14 AM
Calibre sends Newsfeeds as mobi instead of AZW3 to Kindle syntaxis Calibre 2 06-07-2014 03:25 AM
Image and Text problem from epub to mobi for Kindle DX congngo Conversion 0 12-05-2011 05:48 PM
PDF Image -> OCR -> text frikk Workshop 9 07-08-2009 08:21 PM


All times are GMT -4. The time now is 12:59 AM.


MobileRead.com is a privately owned, operated and funded community.