Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 05-01-2013, 01:42 PM   #1
fxp33
Addict
fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.
 
Posts: 261
Karma: 110864
Join Date: Mar 2013
Location: Bordeaux, France
Device: Kobo Glo, Aura HD, kindle paperwhite
no text extraction for pdf with images and OCR

Hi,

I tried to convert pdf files containing the recognised text inside the original images of the book: http://www.freidok.uni-freiburg.de/v...f_der_Zahl.pdf

As you can see, the pdf is made of pictures (images) but you can select the text inside, and even make a "copy all" and paste it in a text software.

If I convert the pdf to epub with --no-images, there is absolutely no text inside the epub.
If I convert with images, only images (reduced) are in the epub.

Is there a way to get the text of such pdf without the images of the pages ?

(calibre version 0.9.28; win XP sp3; adobe pdf 10.1.4)

Thanks for your help

François
fxp33 is offline   Reply With Quote
Old 05-01-2013, 07:54 PM   #2
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,477
Karma: 26944418
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Some pdf readers (eg Tracker's PDF XChange and the one that's built into Firefox) include lightweight OCR - that's what's happening when you do a select all, copy and paste into a text editor.

I would paste the text into my WP program - do some tidying, e.g. removing the page numbers and prettifying the font page, then save it in a format that Calibre can handle, in Open Office Writer that would be ePub via the OO add in, and in MS Office that would be HTML Filtered which you would convert to Epub.

You could OCR the PDF, free tools here ==>> http://www.makeuseof.com/tag/3-free-...ble-documents/

However I just scanned the PDF with Nuance's Omnipage, it was only marginally better than cut & paste from Tracker PDF XChange and would still need some tidying up.

BR
BetterRed is offline   Reply With Quote
Old 05-02-2013, 12:09 PM   #3
fxp33
Addict
fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.
 
Posts: 261
Karma: 110864
Join Date: Mar 2013
Location: Bordeaux, France
Device: Kobo Glo, Aura HD, kindle paperwhite
Quote:
Originally Posted by BetterRed View Post
However I just scanned the PDF with Nuance's Omnipage, it was only marginally better than cut & paste from Tracker PDF XChange and would still need some tidying up.
Thank you for testing it. I really thought the pdf was already OCRed with omnipage or another programm, to allow the copy of text "behind" the image.

... So, for you, there is, at the moment, no specific parameter for Calibre to convert a bunch of pdf files of this kind (with the heuristics arguments to avoid doing by hand all the tidying up)?

Thanks again for any hint.

François
fxp33 is offline   Reply With Quote
Old 05-02-2013, 09:44 PM   #4
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,477
Karma: 26944418
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by fxp33 View Post
Thank you for testing it. I really thought the pdf was already OCRed with omnipage or another programm, to allow the copy of text "behind" the image.

... So, for you, there is, at the moment, no specific parameter for Calibre to convert a bunch of pdf files of this kind (with the heuristics arguments to avoid doing by hand all the tidying up)?

Thanks again for any hint.

François
@François - I should have written - I think some pdf readers... include lightweight OCR...

I'm not aware that the 'text' can be inserted 'behind' the image within the PDF itself. But I am not as up to speed with these issues as I once was - so perhaps that is what's happening.

I ran your PDF through the MobiCreator PDF converter - I've put the output into an attached zip - its interesting, even if not very useful - but it does at least contain the text

I'm told that the Google on-line OCR PDF scanner is as good or better than the some of the free PDF OCR scanners, but I don't know if it does bulk scanning. Given that you're looking at 'old documents', it is possible that Google have already OCR'd them - maybe the University would know that.

I also rescanned the PDF with Omnipage, after doing some tweaks to settings I was able to remove page numbers and improve some other things, but the output was still in need of 'tidying up'. I doubt there is a solution to that, Project Gutenberg uses volunteer proof readers to do the tidying up, I'm not sure what Google does.

BR
Attached Files
File Type: zip Husserl_Ueber_den_Begriff_der_Zahl.zip (8.98 MB, 391 views)
BetterRed is offline   Reply With Quote
Old 05-03-2013, 05:23 PM   #5
fxp33
Addict
fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.
 
Posts: 261
Karma: 110864
Join Date: Mar 2013
Location: Bordeaux, France
Device: Kobo Glo, Aura HD, kindle paperwhite
Hi BetteRed,

Thank you very much for your work and the file you sent.
Indeed the result is very interesting, keeping the images and the text!

I am also impressed by the way you could get rid of some redundant piece of information.

I guess I really should get Omnipage or Mobicreator!! I think my Calibre conversion never got that good! (no offense David)

I'll make some more testing and keep you updated.

Thanks again for orienting me to other converters.

François
fxp33 is offline   Reply With Quote
Old 05-08-2013, 12:46 AM   #6
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,477
Karma: 26944418
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Hi François

Yesterday I installed the Nitro PDF PRO program for one of my projects.

I still had your PDF lying around, so I had Nitro PDF convert it to a Word .doc file, which I saved as Filtered HTML and then I converted it to Epub with Calibre

The results are in the attached zip - its arguably the best yet

You would still need to do some tidying up in Word or OOo Writer before converting, or you could use Sigil to do that on the ePUB

The conversion to Word was done in two phases; an image processing phase, this took 10-15 minutes and a document creation phase that took 2-3 minutes. That was slower than Omnipage, but IMO the result is considerably better.

BR
Attached Files
File Type: zip Desktop.zip (503.3 KB, 388 views)

Last edited by BetterRed; 05-08-2013 at 12:50 AM. Reason: typo
BetterRed is offline   Reply With Quote
Old 05-09-2013, 03:51 AM   #7
fxp33
Addict
fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.fxp33 figured out that Keyser Söze was the Kevin Spacey character in less than 20 minutes.
 
Posts: 261
Karma: 110864
Join Date: Mar 2013
Location: Bordeaux, France
Device: Kobo Glo, Aura HD, kindle paperwhite
Hi BetterRed,

Thank you so much for your time and the use of Nitro : it does look pro

It seems to be more faithful to the original layout than Omnipage.

Nevertheless, I find the first files you sent me more accurate on the text recognition level.

... and both are not arguably better than the copy-paste of the original pdf OCR

Do you intend to publish an extensive comparison on pdf conversion ? I would be an interested audience.

Thank you again for sharing your results.

François
fxp33 is offline   Reply With Quote
Old 12-15-2015, 07:22 AM   #8
wubuer
Junior Member
wubuer began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Dec 2015
Device: none
Quote:
Originally Posted by BetterRed View Post
@François - I should have written - I think some pdf readers... include lightweight online OCR...

I'm not aware that the 'text' can be inserted 'behind' the image within the PDF itself. But I am not as up to speed with these issues as I once was - so perhaps that is what's happening.

I ran your PDF through the MobiCreator PDF converter - I've put the output into an attached zip - its interesting, even if not very useful - but it does at least contain the text

I'm told that the Google on-line OCR PDF scanner is as good or better than the some of the free PDF OCR scanners, but I don't know if it does bulk scanning. Given that you're looking at 'old documents', it is possible that Google have already OCR'd them - maybe the University would know that.

I also rescanned the PDF with Omnipage, after doing some tweaks to settings I was able to remove page numbers and improve some other things, but the output was still in need of 'tidying up'. I doubt there is a solution to that, Project Gutenberg uses volunteer proof readers to do the tidying up, I'm not sure what Google does.

BR
thanks for your information, it's useful.
wubuer is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
PDF to Mobi with text and images pocketsprocket Kindle Formats 7 05-21-2012 07:06 AM
Free PDF to text OCR Converter Thasaidon Deals and Resources (No Self-Promotion or Affiliate Links) 1 04-02-2012 11:58 AM
Scanned text pdf with OCR but graphical layer instead vectorial whopper PDF 2 09-10-2011 06:32 PM
PDF to Epub - Images with Text ebahm Calibre 2 09-19-2010 03:23 PM
PDF Image -> OCR -> text frikk Workshop 9 07-08-2009 07:21 PM


All times are GMT -4. The time now is 04:15 AM.


MobileRead.com is a privately owned, operated and funded community.