Quote:
Originally Posted by willus
As I suspected, the images are stored in JPEG 2000 format (you can see this when you use the k2pdfopt -i option), which taxes most PDF readers significantly more than JPEG or PNG. Moreover, they are 600 dpi--very high res. That is probably why your reader does not like displaying the file--not because of the hidden text. The default k2pdfopt output is PNG ("Flate"), which is much faster to display, but, as you noted, balloons the file size considerably depending on your chosen resolution and color depth. You might try leaving OCR selected (-ocr m) rather than disabling it. I'll bet it will still work fine and you'll then be able to search the document.
There is not a trivial way to simply remove hidden text from a PDF and leave everything else exactly the way it is. I could maybe make it easier to use the method I showed you with a single command-line option to try to intelligently choose the parameters, but in terms of leaving all of the bitmaps in exactly their original format (highly compressed JPEG 2000), I don't have a way to do that.
|
Thank you! It is not the perfect one-button-solution for all my problems, but now I understand what is happening!
I learned about JPEG 2000 just 2 minutes ago when downloading a set of scanned images from archive.org and failing to make scantailor work on them. Talk about Sincronicity!
Way better suport that I've ever had from any company! You're awesome!
Just out of curiosity, do you have a guess of if KOreader would do a better job with this kind of pdf instead of the Nikel standart software on my Kobo TouchC? And how did you found out about the resolution of the images on the PDF, is there a option to do that on K2PDFopt? I Couldn't find it. And the JPX & JBIG2 on brackets on -i are the file formats of the imagens than?