Minimising the size of a page-scan PDF
My university library, to which I have access as an alumnus, has a superb collection of the classic works of Egyptology, many of which were published in the 19th and early 20th centuries and are long out of copyright. I have, therefore, been gradually borrowing and copying them for my personal collection. Because Egyptology books tend to have things like hieroglyphs in them, as well as lots of drawings and other illustrations, I'm creating page-scanned PDFs to read on my iPad. My process is as follows:
1. Put the book on the floor in good light.
2. Photograph each page with my DSLR.
3. Process the raw images in Adobe Lightroom to boost contrast, trim margins, etc.
4. Export all the page images as JPEGs.
5. Zip all the images up and rename the ZIP file to have a ".CBZ" extension.
6. Import the CBZ file into Calibre.
7. Do a conversion to PDF in Calibre.
With this method I can create a beautiful page-scan PDF of a 200-page book in about 3h which looks superb on my iPad.
But...
... It's huge!
A high quality JPEG image of a single page is typically around 600kb, so a 200 page book ends up as a PDF file about 120MB in size (the size of the PDF is basically just the sum of the size of all the page images). This is a really, really nice PDF that I can zoom in on quite a lot on my iPad (handy for images) with it still looking good.
A 200MB PDF isn't particularly a problem on my 128GB iPad, but I notice that most equivalent page-scan PDFs I download from "archive.org" are only 10-20MB in size - ie 5-10% of the size of mine. They don't look quite as good as mine, but they're pretty good!
Does anyone know how they do this? If I reduce the size and/or quality of my page images from 600kb to 60kb the result looks appalling. How could I get PDFs a 10th the size of the ones I'm currently creating which still look reasonable?
Any advice would be gratefully received!
I think I actually may know a part of the reason myself. Because I'm photographing the pages with an excellent camera, rather than scanning them, my page images are superb quality. If I zoom in I can see the individual fibres in the surface of the paper. All that "information" in the image is producing big JPEGs. The "archive.org" images just have a "flat" paper background which presumably results in small images, because the text is pretty much the only information on the page. Anyone know how I can remove that fine detail from my page images without making the words blurry too?
Last edited by HarryT; 08-04-2016 at 05:20 PM.
|