05-01-2013, 01:42 PM | #1 |
Addict
Posts: 261
Karma: 110864
Join Date: Mar 2013
Location: Bordeaux, France
Device: Kobo Glo, Aura HD, kindle paperwhite
|
no text extraction for pdf with images and OCR
Hi,
I tried to convert pdf files containing the recognised text inside the original images of the book: http://www.freidok.uni-freiburg.de/v...f_der_Zahl.pdf As you can see, the pdf is made of pictures (images) but you can select the text inside, and even make a "copy all" and paste it in a text software. If I convert the pdf to epub with --no-images, there is absolutely no text inside the epub. If I convert with images, only images (reduced) are in the epub. Is there a way to get the text of such pdf without the images of the pages ? (calibre version 0.9.28; win XP sp3; adobe pdf 10.1.4) Thanks for your help François |
05-01-2013, 07:54 PM | #2 |
null operator (he/him)
Posts: 20,532
Karma: 26944418
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Some pdf readers (eg Tracker's PDF XChange and the one that's built into Firefox) include lightweight OCR - that's what's happening when you do a select all, copy and paste into a text editor.
I would paste the text into my WP program - do some tidying, e.g. removing the page numbers and prettifying the font page, then save it in a format that Calibre can handle, in Open Office Writer that would be ePub via the OO add in, and in MS Office that would be HTML Filtered which you would convert to Epub. You could OCR the PDF, free tools here ==>> http://www.makeuseof.com/tag/3-free-...ble-documents/ However I just scanned the PDF with Nuance's Omnipage, it was only marginally better than cut & paste from Tracker PDF XChange and would still need some tidying up. BR |
Advert | |
|
05-02-2013, 12:09 PM | #3 | |
Addict
Posts: 261
Karma: 110864
Join Date: Mar 2013
Location: Bordeaux, France
Device: Kobo Glo, Aura HD, kindle paperwhite
|
Quote:
... So, for you, there is, at the moment, no specific parameter for Calibre to convert a bunch of pdf files of this kind (with the heuristics arguments to avoid doing by hand all the tidying up)? Thanks again for any hint. François |
|
05-02-2013, 09:44 PM | #4 | |
null operator (he/him)
Posts: 20,532
Karma: 26944418
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Quote:
I'm not aware that the 'text' can be inserted 'behind' the image within the PDF itself. But I am not as up to speed with these issues as I once was - so perhaps that is what's happening. I ran your PDF through the MobiCreator PDF converter - I've put the output into an attached zip - its interesting, even if not very useful - but it does at least contain the text I'm told that the Google on-line OCR PDF scanner is as good or better than the some of the free PDF OCR scanners, but I don't know if it does bulk scanning. Given that you're looking at 'old documents', it is possible that Google have already OCR'd them - maybe the University would know that. I also rescanned the PDF with Omnipage, after doing some tweaks to settings I was able to remove page numbers and improve some other things, but the output was still in need of 'tidying up'. I doubt there is a solution to that, Project Gutenberg uses volunteer proof readers to do the tidying up, I'm not sure what Google does. BR |
|
05-03-2013, 05:23 PM | #5 |
Addict
Posts: 261
Karma: 110864
Join Date: Mar 2013
Location: Bordeaux, France
Device: Kobo Glo, Aura HD, kindle paperwhite
|
Hi BetteRed,
Thank you very much for your work and the file you sent. Indeed the result is very interesting, keeping the images and the text! I am also impressed by the way you could get rid of some redundant piece of information. I guess I really should get Omnipage or Mobicreator!! I think my Calibre conversion never got that good! (no offense David) I'll make some more testing and keep you updated. Thanks again for orienting me to other converters. François |
Advert | |
|
05-08-2013, 12:46 AM | #6 |
null operator (he/him)
Posts: 20,532
Karma: 26944418
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Hi François
Yesterday I installed the Nitro PDF PRO program for one of my projects. I still had your PDF lying around, so I had Nitro PDF convert it to a Word .doc file, which I saved as Filtered HTML and then I converted it to Epub with Calibre The results are in the attached zip - its arguably the best yet You would still need to do some tidying up in Word or OOo Writer before converting, or you could use Sigil to do that on the ePUB The conversion to Word was done in two phases; an image processing phase, this took 10-15 minutes and a document creation phase that took 2-3 minutes. That was slower than Omnipage, but IMO the result is considerably better. BR Last edited by BetterRed; 05-08-2013 at 12:50 AM. Reason: typo |
05-09-2013, 03:51 AM | #7 |
Addict
Posts: 261
Karma: 110864
Join Date: Mar 2013
Location: Bordeaux, France
Device: Kobo Glo, Aura HD, kindle paperwhite
|
Hi BetterRed,
Thank you so much for your time and the use of Nitro : it does look pro It seems to be more faithful to the original layout than Omnipage. Nevertheless, I find the first files you sent me more accurate on the text recognition level. ... and both are not arguably better than the copy-paste of the original pdf OCR Do you intend to publish an extensive comparison on pdf conversion ? I would be an interested audience. Thank you again for sharing your results. François |
12-15-2015, 07:22 AM | #8 | |
Junior Member
Posts: 1
Karma: 10
Join Date: Dec 2015
Device: none
|
Quote:
|
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
PDF to Mobi with text and images | pocketsprocket | Kindle Formats | 7 | 05-21-2012 07:06 AM |
Free PDF to text OCR Converter | Thasaidon | Deals and Resources (No Self-Promotion or Affiliate Links) | 1 | 04-02-2012 11:58 AM |
Scanned text pdf with OCR but graphical layer instead vectorial | whopper | 2 | 09-10-2011 06:32 PM | |
PDF to Epub - Images with Text | ebahm | Calibre | 2 | 09-19-2010 03:23 PM |
PDF Image -> OCR -> text | frikk | Workshop | 9 | 07-08-2009 07:21 PM |