View Single Post
Old 05-02-2013, 09:44 PM   #4
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,532
Karma: 26944418
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by fxp33 View Post
Thank you for testing it. I really thought the pdf was already OCRed with omnipage or another programm, to allow the copy of text "behind" the image.

... So, for you, there is, at the moment, no specific parameter for Calibre to convert a bunch of pdf files of this kind (with the heuristics arguments to avoid doing by hand all the tidying up)?

Thanks again for any hint.

François
@François - I should have written - I think some pdf readers... include lightweight OCR...

I'm not aware that the 'text' can be inserted 'behind' the image within the PDF itself. But I am not as up to speed with these issues as I once was - so perhaps that is what's happening.

I ran your PDF through the MobiCreator PDF converter - I've put the output into an attached zip - its interesting, even if not very useful - but it does at least contain the text

I'm told that the Google on-line OCR PDF scanner is as good or better than the some of the free PDF OCR scanners, but I don't know if it does bulk scanning. Given that you're looking at 'old documents', it is possible that Google have already OCR'd them - maybe the University would know that.

I also rescanned the PDF with Omnipage, after doing some tweaks to settings I was able to remove page numbers and improve some other things, but the output was still in need of 'tidying up'. I doubt there is a solution to that, Project Gutenberg uses volunteer proof readers to do the tidying up, I'm not sure what Google does.

BR
Attached Files
File Type: zip Husserl_Ueber_den_Begriff_der_Zahl.zip (8.98 MB, 394 views)
BetterRed is online now   Reply With Quote