PDF with OCR to MOBI

noisy · 06-24-2013, 01:59 AM

There are a lot of documents in PDF formats, which contains scans of very old documents. Part of them also contains OCR layer, like in this document: http://polona.pl/archive_prod?uid=1095122&cid=1095117

I have tried convert it to mobi in Calibre, however I got mobi file only with scans, without any text, which can be get from ocr layer.

Is there any way to pull out ocr text from this PDF and convert only this text to mobi?

Sabardeyn · 06-24-2013, 10:45 AM

Someone more knowledgeable than myself needs to comment on this issue as I don't really use calibre for conversions.

However, I don't think you're going to find this a simple solution. Or one that, with the right settings, can be handled just in calibre. I think you're going to have to strip out the text, use ebook creation software to take the text in and, with extra mark-up and effort, create a new ebook based on the old PDF material. This will be a complex project assuming you want to do it correctly rather than quickly. Particularly if this is meant for long term usage and data retention.

But I would keep the PDFs and the resulting ebook as well. Just in case.

BetterRed · 06-24-2013, 06:14 PM

Have a look at this thread https://www.mobileread.com/forums/sho...d.php?t=212056

BR

06-24-2013, 01:59 AM	#1
noisy Member Posts: 22 Karma: 12 Join Date: Oct 2011 Device: kindle 3	PDF with OCR to MOBI There are a lot of documents in PDF formats, which contains scans of very old documents. Part of them also contains OCR layer, like in this document: http://polona.pl/archive_prod?uid=1095122&cid=1095117 I have tried convert it to mobi in Calibre, however I got mobi file only with scans, without any text, which can be get from ocr layer. Is there any way to pull out ocr text from this PDF and convert only this text to mobi?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
no text extraction for pdf with images and OCR	fxp33	Conversion	7	12-15-2015 07:22 AM
Free PDF to text OCR Converter	Thasaidon	Deals and Resources (No Self-Promotion or Affiliate Links)	1	04-02-2012 11:58 AM
remove OCR from a PDF?	soondai	PDF	9	10-08-2011 12:42 PM
Google Adds OCR for PDF Files	kjk	News	0	06-22-2010 02:27 PM
PDF Image -> OCR -> text	frikk	Workshop	9	07-08-2009 07:21 PM

06-24-2013, 10:45 AM	#2
Sabardeyn Guru Posts: 644 Karma: 1242364 Join Date: May 2009 Location: The Right Coast Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)	Someone more knowledgeable than myself needs to comment on this issue as I don't really use calibre for conversions. However, I don't think you're going to find this a simple solution. Or one that, with the right settings, can be handled just in calibre. I think you're going to have to strip out the text, use ebook creation software to take the text in and, with extra mark-up and effort, create a new ebook based on the old PDF material. This will be a complex project assuming you want to do it correctly rather than quickly. Particularly if this is meant for long term usage and data retention. But I would keep the PDFs and the resulting ebook as well. Just in case.

06-24-2013, 06:14 PM	#3
BetterRed null operator (he/him) Posts: 20,572 Karma: 26954694 Join Date: Mar 2012 Location: Sydney Australia Device: none	Have a look at this thread https://www.mobileread.com/forums/sho...d.php?t=212056 BR