|06-24-2013, 01:59 AM||#1|
Join Date: Oct 2011
Device: kindle 3
PDF with OCR to MOBI
There are a lot of documents in PDF formats, which contains scans of very old documents. Part of them also contains OCR layer, like in this document: http://polona.pl/archive_prod?uid=1095122&cid=1095117
I have tried convert it to mobi in Calibre, however I got mobi file only with scans, without any text, which can be get from ocr layer.
Is there any way to pull out ocr text from this PDF and convert only this text to mobi?
|06-24-2013, 10:45 AM||#2|
Join Date: May 2009
Location: The Right Coast
Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)
Someone more knowledgeable than myself needs to comment on this issue as I don't really use calibre for conversions.
However, I don't think you're going to find this a simple solution. Or one that, with the right settings, can be handled just in calibre. I think you're going to have to strip out the text, use ebook creation software to take the text in and, with extra mark-up and effort, create a new ebook based on the old PDF material. This will be a complex project assuming you want to do it correctly rather than quickly. Particularly if this is meant for long term usage and data retention.
But I would keep the PDFs and the resulting ebook as well. Just in case.
|06-24-2013, 06:14 PM||#3|
Join Date: Mar 2012
Location: Sydney Australia
Have a look at this thread https://www.mobileread.com/forums/sho...d.php?t=212056
|Thread Tools||Search this Thread|
|Thread||Thread Starter||Forum||Replies||Last Post|
|no text extraction for pdf with images and OCR||fxp33||Conversion||7||12-15-2015 07:22 AM|
|Free PDF to text OCR Converter||Thasaidon||Deals, Freebies, and Resources (No Self-Promotion)||1||04-02-2012 11:58 AM|
|remove OCR from a PDF?||soondai||9||10-08-2011 12:42 PM|
|Google Adds OCR for PDF Files||kjk||News||0||06-22-2010 02:27 PM|
|PDF Image -> OCR -> text||frikk||Workshop||9||07-08-2009 07:21 PM|