no text extraction for pdf with images and OCR

fxp33 · 05-01-2013, 01:42 PM

Hi,

I tried to convert pdf files containing the recognised text inside the original images of the book: http://www.freidok.uni-freiburg.de/v...f_der_Zahl.pdf

As you can see, the pdf is made of pictures (images) but you can select the text inside, and even make a "copy all" and paste it in a text software.

If I convert the pdf to epub with --no-images, there is absolutely no text inside the epub.
If I convert with images, only images (reduced) are in the epub.

Is there a way to get the text of such pdf without the images of the pages ?

(calibre version 0.9.28; win XP sp3; adobe pdf 10.1.4)

Thanks for your help

François

BetterRed · 05-01-2013, 07:54 PM

Some pdf readers (eg Tracker's PDF XChange and the one that's built into Firefox) include lightweight OCR - that's what's happening when you do a select all, copy and paste into a text editor.

I would paste the text into my WP program - do some tidying, e.g. removing the page numbers and prettifying the font page, then save it in a format that Calibre can handle, in Open Office Writer that would be ePub via the OO add in, and in MS Office that would be HTML Filtered which you would convert to Epub.

You could OCR the PDF, free tools here ==>> http://www.makeuseof.com/tag/3-free-...ble-documents/

However I just scanned the PDF with Nuance's Omnipage, it was only marginally better than cut & paste from Tracker PDF XChange and would still need some tidying up.

BR

fxp33 · 05-02-2013, 12:09 PM

Quote:

Originally Posted by BetterRed

However I just scanned the PDF with Nuance's Omnipage, it was only marginally better than cut & paste from Tracker PDF XChange and would still need some tidying up.

Thank you for testing it. I really thought the pdf was already OCRed with omnipage or another programm, to allow the copy of text "behind" the image.

... So, for you, there is, at the moment, no specific parameter for Calibre to convert a bunch of pdf files of this kind (with the heuristics arguments to avoid doing by hand all the tidying up)?

Thanks again for any hint.

François

BetterRed · 05-02-2013, 09:44 PM

Quote:

Originally Posted by fxp33

Thank you for testing it. I really thought the pdf was already OCRed with omnipage or another programm, to allow the copy of text "behind" the image.

... So, for you, there is, at the moment, no specific parameter for Calibre to convert a bunch of pdf files of this kind (with the heuristics arguments to avoid doing by hand all the tidying up)?

Thanks again for any hint.

François

@François - I should have written - I think some pdf readers... include lightweight OCR...

I'm not aware that the 'text' can be inserted 'behind' the image within the PDF itself. But I am not as up to speed with these issues as I once was - so perhaps that is what's happening.

I ran your PDF through the MobiCreator PDF converter - I've put the output into an attached zip - its interesting, even if not very useful - but it does at least contain the text

I'm told that the Google on-line OCR PDF scanner is as good or better than the some of the free PDF OCR scanners, but I don't know if it does bulk scanning. Given that you're looking at 'old documents', it is possible that Google have already OCR'd them - maybe the University would know that.

I also rescanned the PDF with Omnipage, after doing some tweaks to settings I was able to remove page numbers and improve some other things, but the output was still in need of 'tidying up'. I doubt there is a solution to that, Project Gutenberg uses volunteer proof readers to do the tidying up, I'm not sure what Google does.

BR

fxp33 · 05-03-2013, 05:23 PM

Hi BetteRed,

Thank you very much for your work and the file you sent.
Indeed the result is very interesting, keeping the images and the text!

I am also impressed by the way you could get rid of some redundant piece of information.

I guess I really should get Omnipage or Mobicreator!! I think my Calibre conversion never got that good! (no offense David)

I'll make some more testing and keep you updated.

Thanks again for orienting me to other converters.

François

BetterRed · 05-08-2013, 12:46 AM

Hi François

Yesterday I installed the Nitro PDF PRO program for one of my projects.

I still had your PDF lying around, so I had Nitro PDF convert it to a Word .doc file, which I saved as Filtered HTML and then I converted it to Epub with Calibre

The results are in the attached zip - its arguably the best yet

You would still need to do some tidying up in Word or OOo Writer before converting, or you could use Sigil to do that on the ePUB

The conversion to Word was done in two phases; an image processing phase, this took 10-15 minutes and a document creation phase that took 2-3 minutes. That was slower than Omnipage, but IMO the result is considerably better.

BR

fxp33 · 05-09-2013, 03:51 AM

Hi BetterRed,

Thank you so much for your time and the use of Nitro : it does look pro

It seems to be more faithful to the original layout than Omnipage.

Nevertheless, I find the first files you sent me more accurate on the text recognition level.

... and both are not arguably better than the copy-paste of the original pdf OCR

Do you intend to publish an extensive comparison on pdf conversion ? I would be an interested audience.

Thank you again for sharing your results.

François

wubuer · 12-15-2015, 07:22 AM

Quote:

Originally Posted by BetterRed

@François - I should have written - I think some pdf readers... include lightweight online OCR...

I'm not aware that the 'text' can be inserted 'behind' the image within the PDF itself. But I am not as up to speed with these issues as I once was - so perhaps that is what's happening.

I ran your PDF through the MobiCreator PDF converter - I've put the output into an attached zip - its interesting, even if not very useful - but it does at least contain the text

I'm told that the Google on-line OCR PDF scanner is as good or better than the some of the free PDF OCR scanners, but I don't know if it does bulk scanning. Given that you're looking at 'old documents', it is possible that Google have already OCR'd them - maybe the University would know that.

I also rescanned the PDF with Omnipage, after doing some tweaks to settings I was able to remove page numbers and improve some other things, but the output was still in need of 'tidying up'. I doubt there is a solution to that, Project Gutenberg uses volunteer proof readers to do the tidying up, I'm not sure what Google does.

BR

thanks for your information, it's useful.

05-01-2013, 01:42 PM	#1
fxp33 Addict Posts: 261 Karma: 110864 Join Date: Mar 2013 Location: Bordeaux, France Device: Kobo Glo, Aura HD, kindle paperwhite	no text extraction for pdf with images and OCR Hi, I tried to convert pdf files containing the recognised text inside the original images of the book: http://www.freidok.uni-freiburg.de/v...f_der_Zahl.pdf As you can see, the pdf is made of pictures (images) but you can select the text inside, and even make a "copy all" and paste it in a text software. If I convert the pdf to epub with --no-images, there is absolutely no text inside the epub. If I convert with images, only images (reduced) are in the epub. Is there a way to get the text of such pdf without the images of the pages ? (calibre version 0.9.28; win XP sp3; adobe pdf 10.1.4) Thanks for your help François

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 07:06 AM
Free PDF to text OCR Converter	Thasaidon	Deals and Resources (No Self-Promotion or Affiliate Links)	1	04-02-2012 11:58 AM
Scanned text pdf with OCR but graphical layer instead vectorial	whopper	PDF	2	09-10-2011 06:32 PM
PDF to Epub - Images with Text	ebahm	Calibre	2	09-19-2010 03:23 PM
PDF Image -> OCR -> text	frikk	Workshop	9	07-08-2009 07:21 PM

05-01-2013, 07:54 PM	#2
BetterRed null operator (he/him) Posts: 20,532 Karma: 26944418 Join Date: Mar 2012 Location: Sydney Australia Device: none	Some pdf readers (eg Tracker's PDF XChange and the one that's built into Firefox) include lightweight OCR - that's what's happening when you do a select all, copy and paste into a text editor. I would paste the text into my WP program - do some tidying, e.g. removing the page numbers and prettifying the font page, then save it in a format that Calibre can handle, in Open Office Writer that would be ePub via the OO add in, and in MS Office that would be HTML Filtered which you would convert to Epub. You could OCR the PDF, free tools here ==>> http://www.makeuseof.com/tag/3-free-...ble-documents/ However I just scanned the PDF with Nuance's Omnipage, it was only marginally better than cut & paste from Tracker PDF XChange and would still need some tidying up. BR

05-03-2013, 05:23 PM	#5
fxp33 Addict Posts: 261 Karma: 110864 Join Date: Mar 2013 Location: Bordeaux, France Device: Kobo Glo, Aura HD, kindle paperwhite	Hi BetteRed, Thank you very much for your work and the file you sent. Indeed the result is very interesting, keeping the images and the text! I am also impressed by the way you could get rid of some redundant piece of information. I guess I really should get Omnipage or Mobicreator!! I think my Calibre conversion never got that good! (no offense David) I'll make some more testing and keep you updated. Thanks again for orienting me to other converters. François

05-09-2013, 03:51 AM	#7
fxp33 Addict Posts: 261 Karma: 110864 Join Date: Mar 2013 Location: Bordeaux, France Device: Kobo Glo, Aura HD, kindle paperwhite	Hi BetterRed, Thank you so much for your time and the use of Nitro : it does look pro It seems to be more faithful to the original layout than Omnipage. Nevertheless, I find the first files you sent me more accurate on the text recognition level. ... and both are not arguably better than the copy-paste of the original pdf OCR Do you intend to publish an extensive comparison on pdf conversion ? I would be an interested audience. Thank you again for sharing your results. François

Advert

Advert