08-04-2016, 05:17 PM | #1 |
eBook Enthusiast
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
Minimising the size of a page-scan PDF
My university library, to which I have access as an alumnus, has a superb collection of the classic works of Egyptology, many of which were published in the 19th and early 20th centuries and are long out of copyright. I have, therefore, been gradually borrowing and copying them for my personal collection. Because Egyptology books tend to have things like hieroglyphs in them, as well as lots of drawings and other illustrations, I'm creating page-scanned PDFs to read on my iPad. My process is as follows:
1. Put the book on the floor in good light. 2. Photograph each page with my DSLR. 3. Process the raw images in Adobe Lightroom to boost contrast, trim margins, etc. 4. Export all the page images as JPEGs. 5. Zip all the images up and rename the ZIP file to have a ".CBZ" extension. 6. Import the CBZ file into Calibre. 7. Do a conversion to PDF in Calibre. With this method I can create a beautiful page-scan PDF of a 200-page book in about 3h which looks superb on my iPad. But... ... It's huge! A high quality JPEG image of a single page is typically around 600kb, so a 200 page book ends up as a PDF file about 120MB in size (the size of the PDF is basically just the sum of the size of all the page images). This is a really, really nice PDF that I can zoom in on quite a lot on my iPad (handy for images) with it still looking good. A 200MB PDF isn't particularly a problem on my 128GB iPad, but I notice that most equivalent page-scan PDFs I download from "archive.org" are only 10-20MB in size - ie 5-10% of the size of mine. They don't look quite as good as mine, but they're pretty good! Does anyone know how they do this? If I reduce the size and/or quality of my page images from 600kb to 60kb the result looks appalling. How could I get PDFs a 10th the size of the ones I'm currently creating which still look reasonable? Any advice would be gratefully received! I think I actually may know a part of the reason myself. Because I'm photographing the pages with an excellent camera, rather than scanning them, my page images are superb quality. If I zoom in I can see the individual fibres in the surface of the paper. All that "information" in the image is producing big JPEGs. The "archive.org" images just have a "flat" paper background which presumably results in small images, because the text is pretty much the only information on the page. Anyone know how I can remove that fine detail from my page images without making the words blurry too? Last edited by HarryT; 08-04-2016 at 05:20 PM. |
08-06-2016, 08:15 AM | #2 |
frumious Bandersnatch
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
How do you reduce the size and quality of the images? I'd try first going to grayscale (unless the page has colours), and then adjusting the levels (not just the contrast) to have a white background with no texture and black text. Then reduce the pixel size to the minimum you'll be satisfied with, and save as JPG with the highest compression level you are satisfied with. Reducing the number of grey levels to something like 16 before saving could also help.
And do you need to convert it PDF? Can't you just read the CBZ (CBR or CB7 could be somewhat smaller)? I you post a couple of sample pages I could have a go and give you some specific settings... |
08-06-2016, 08:28 AM | #3 |
eBook Enthusiast
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
Hi Jellby,
The reason I prefer PDF is that it's much more portable than CBZ. Please find attached a couple of sample pages. These are at the minimum resolution I consider to be readable. I need to have the hieroglyphs sharp and clear. Any suggestions for significantly reducing the size of the page image would be very gratefully received. |
08-06-2016, 09:40 AM | #4 |
frumious Bandersnatch
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Let's see if this helps. I did this with gimp, which may behave different from photoshop.
015: Changed to greyscale mode. Adjusted levels: input levels from 70 (black) to 230 (white). Saved as JPG with 85% quality (my default). Result: 015 -_(2).jpg, 40% size. Saved as JPG with 30% quality. Result: 015 -_(3).jpg, 19% size. Changed to indexed color mode, 16 maximum levels. Saved as PNG, maximum compression. Result: 015 -_(4).png, 18% size. 225: Same process, but adjusted the levels between 85 and 215. 225 -_(2).jpg: 38%. 225 -_(3).jpg: 19%. 225 -_(4).png: 16%. I think the 30% quality JPGs are too aggressive and blurry, but the PNGs look quite acceptable. |
08-06-2016, 09:46 AM | #5 |
eBook Enthusiast
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
Thank you very much indeed for your excellent suggestions. I'll study the results and see what I think is acceptable. Really appreciate the help - thanks!
|
08-06-2016, 10:27 AM | #6 |
Wizard
Posts: 1,613
Karma: 6718479
Join Date: Dec 2004
Location: Paradise (Key West, FL)
Device: Current:Surface Go & Kindle 3 - Retired: DellV8p, Clie UX50, ...
|
I think Jellby is on the right track. ZIP compression will do very little if any actual compression on a JPEG. The trick will be to reduce the size of the JPEGs before creating the CBZ/ZIP.
If you should choose to use Photoshop, you should NEVER EVER use the Ps option to "Save as..." to create JPEGs for any CBZ, eBook, or web use. You should only use its "Save for Web and Devices..." option. "Save as..." embeds a whole plethora of Ps specific ancillary data (guides, ...) in the JPEG thus bloating the size. "Save for Web and Devices..." will not do this and offers additional options to strip even more metadata. As a result it will produce substantially smaller JPEGs when using the same quality settings. |
08-06-2016, 10:48 AM | #7 |
eBook Enthusiast
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
Thanks, dwig. I use Adobe Lightroom, not Photoshop. LR doesn't add the junk that PS does.
|
08-08-2016, 09:35 PM | #8 |
Wizard
Posts: 2,986
Karma: 18343081
Join Date: Oct 2010
Location: Sudbury, ON, Canada
Device: PRS-505, PB 902, PRS-T1, PB 623, PB 840, PB 633
|
I wrote myself a program for whitening the background of grayscale images. Given some threshold, if a pixel is lighter than the threshold, and all of the eight surrounding pixels are lighter than the threshold, the central pixel is set to be white. It works surprisingly well for such a simple idea. The solid white background will then compress much better than a noisy gray background. It also greatly improves the contrast on an e-ink device.
|
08-09-2016, 09:01 AM | #9 |
Fuzzball, the purple cat
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
I have noticed that highly compressed PDFs I've looked at over the years tend to (1) use 1-bit color layers and/or (2) use JPEG-2000 / JPX compression streams. For instance, if I scan a black and white document that I've marked with a red pen on the copiers where I work, at high compression settings, the copier scanning algorithm creates two layers: a black/white 1-bit layer for the black and white text plus a separate red/transparent 1-bit layer to overlay my red markups. It seems like a pretty sophisticated algorithm. I'm sure if you enhance your contrast ratio via some of the suggestions already made, you should be able to compress to fewer shades of gray--maybe even 1 bit (just black and white), but I'm not sure you quite have the resolution for that.
|
08-09-2016, 09:11 AM | #10 |
eBook Enthusiast
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
The original is a camera RAW file with a resolution of 6000x4000 pixels, and a 14-bit pixel depth (ie 16000 intensity levels). See the attached full-resolution sample: you can clearly see the fibres of the paper surface.
|
08-09-2016, 09:29 AM | #11 | |
The Grand Mouse 高貴的老鼠
Posts: 71,506
Karma: 306214458
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
|
Quote:
Reducing to greyscale may make a JPEG a bit smaller, but if the page is all essentially text (not greyscale images), reducing the 4-bit greyscale and saving as PNG might work even better. Could you attach full-size page images, not already reduced to a smaller pixel size? I'd be interested in having a play this evening. |
|
08-15-2016, 03:12 PM | #12 |
Guru
Posts: 860
Karma: 4380
Join Date: Feb 2008
Location: Almada, Portugal
Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note
|
Hi HarryT
Maybe using another tool can help here. Using the 2 pages from your post I created a pdf using the method you describe – created a cbz and converted to pdf with calibre. The result pdf was 560 Kbytes in size. Using Adobe Acrobat Pro 11 and applying ocr (French) using the “clear scan” option - from wikipedia “Adobe ClearScan technology creates and embeds custom Type1-CID fonts to match the visual appearance of a scanned document after optical character recognition. ClearScan uses these newly created custom fonts instead of system fonts or Type1-MM” -, the result was a pdf 256 Kbytes in size. Similar result was obtained using Finereader Pro 11 and saving image pdf only - the file was 266 Kbytes in size. Saving using the same option but activating the setting “"use mixed raster content" gives an even smaller file size of 108 Kbytes, but I do not advise the use of this option as the result is not good. The final size of a pdf has lots to do with the complexity, color and other details o the original pages, so to get you an idea I would have to have access to, at least, 30 to 50 pages, or if possible all the jpg's of a full book. Notes: 1 - other more professional (and much more expensive) pieces of software can even get you smaller file sizes; 2 - all 3 example files created are attached. Best regards, Last edited by DDHarriman; 08-16-2016 at 02:53 PM. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Epub to PDF for printing - page size | Westcork | Conversion | 10 | 09-21-2016 05:34 AM |
e-pub to pdf:- page size conundrum | MrB | Conversion | 2 | 10-03-2012 09:42 AM |
PRS-950 PDF Page Size | jessie102 | Sony Reader | 6 | 12-16-2010 02:15 PM |
PDF output - page size/orientation problems | kurokaze | Calibre | 1 | 09-26-2010 06:08 PM |
PDF page size | DuckDodgers | iRex | 2 | 08-09-2006 02:17 PM |