Cropping Pages Permanently with Acrobat Pro - Page 2

willus · 11-08-2015, 12:11 AM

Quote:

Originally Posted by PHC

I bring it up because it re-encodes images and loses the TOC. Both of those are undesirable and eliminated by cpdf. And cpdf is much easier to use.

The method I posted does not re-encode the images in the PDF.

[Edit 9 Nov 15: My claim here is not correct. Ghostscript with "pdfwrite" output does re-encode images, so if the source PDF encodings are lossy, the output images can change--continue reading in thread.]

PHC · 11-08-2015, 01:31 AM

Quote:

Originally Posted by willus

The method I posted does not re-encode the images in the PDF.

OK, I just did a quick test. I extracted 10 pages from a scanned OCRed PDF using Acrobat. I then used your exact parameters, which are just the default ones you probably blindly copied from another post. Though I do that initially when I want to try something, I will then go and read the documentation and learn what other parameters I need to pay attention to. First off, the input file was 902kB, while the output file was 725kB. Second, I got an error:

Code:

GPL Ghostscript 9.15: Missing glyph CID=0, glyph=0067 in the font HiddenHorzOCR . The output PDF may fail with some viewers.

I then opened both files in Acrobat and maximized them and did a simple A-B comparison using the keyboard to switch rapidly back and forth multiple times. I looked at various pages and chose the better looking file without knowing which was which. Anything that was monochrome (black on white) was indistinguishable but grayscale images in the gs file definitely showed noticeable mosquito noise around lines and edges. So smaller file, missing glyph, noise, all indicators of lossiness. I didn't try it on text or vector graphics. It would no doubt be closer but probably not identical. In any case, using default settings for gs for every case is a mistake. You need to tweak the settings for each case. I've done countless hours of testing with many different settings and made notes on the results. Have you?

BTW, if I simply copy the PDF with cpdf to a new file, it is identical.

willus · 11-08-2015, 10:35 AM

Quote:

Originally Posted by PHC

OK, I just did a quick test. I extracted 10 pages from a scanned OCRed PDF using Acrobat...

Please post your source and converted files, the exact command you used, the version of Ghostscript you used, and the OS you did the conversion on. This method was sent to me in an e-mail and I verified that it worked. So "copied" yes, "blindly copied", no. Why would I change something that works? Attached are my examples. Command used (on Windows 10 PC):

Code:

C:\>gs
GPL Ghostscript 9.02 (2011-03-30)
Copyright (C) 2010 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
GS>quit

C:\>gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=pooh_ghostscript_out.pdf pooh_src.pdf

Resulting PDF information in each file is below. Notice that the bitmap of the page has identical resolution, depth, and encoding method.

Code:

C:\>k2pdfopt -i pooh_src.pdf
k2pdfopt v2.33a (w/MuPDF,DjVuLibre,OCR) (c) 2015, GPLv3, http://willus.com
    Compiled Oct  3 2015 with Gnu C (Mingw64) v5.2.0 for Win64 on x64.

FILE:           pooh_src.pdf
PDF VERSION:    1.3
TITLE:          pooh.pdf
CREATED:        D:20100328132840
LAST MODIFIED:  D:20151108072341-08'00
PDF PRODUCER:   K2pdfopt v2.33a
FILE SIZE:      346.4 kB (354,746 bytes)
PAGES:          1

       Page       Ref           Details
Mediaboxes (1):
        1       (2 0 R):        [ 0 0 455.5 579.4 ] (6.33 x 8.05 in)

Fonts (2):
        1       (2 0 R):        Type1 'Helvetica' (0 0 R)
        1       (2 0 R):        Type1 'Helvetica' (0 0 R)

Images (1):
        1       (2 0 R):        [ Flate ] 949x1207 4bpc DevRGB (5 0 R)


C:\>k2pdfopt -i pooh_ghostscript_out.pdf
k2pdfopt v2.33a (w/MuPDF,DjVuLibre,OCR) (c) 2015, GPLv3, http://willus.com
    Compiled Oct  3 2015 with Gnu C (Mingw64) v5.2.0 for Win64 on x64.

FILE:           pooh_ghostscript_out.pdf
PDF VERSION:    1.4
CREATED:        D:20151108072511-08'00'
LAST MODIFIED:  D:20151108072511-08'00'
PDF PRODUCER:   GPL Ghostscript 9.02
FILE SIZE:      426.0 kB (436,225 bytes)
PAGES:          1

       Page       Ref           Details
Mediaboxes (1):
        1       (4 0 R):        [ 0 0 455.5 579.4 ] (6.33 x 8.05 in)

Fonts (1):
        1       (4 0 R):        Type1 'Helvetica' (9 0 R)

Images (1):
        1       (4 0 R):        [ Flate ] 949x1207 4bpc DevRGB (8 0 R)

PS. I have never argued that cpdf cannot make identical copies of a PDF or that ghostscript is better at it. I originally posted because the OP wanted to remove cropped content, and the method I posted removes cropped-out text (but not cropped-out images). If cpdf can remove cropped-out areas (without help from Acrobat), then please post how, otherwise I consider it irrelevant to why I posted.

PHC · 11-08-2015, 08:48 PM

Code:

Resulting PDF information in each file is below.  Notice that the bitmap of the page has identical resolution, depth, and encoding method.

The data doesn't mean anything. An image and its copy can have identical specs but be very different in quality. The proof is in the seeing. Here's a simple example:

I used imagemagick to deliberately reduce the quality of an image. These are the specs of both images:

Code:

General
Complete name                            : 150729145544-03-trump-quotes-super-169.jpg
Format                                   : JPEG
File size                                : 81.1 KiB

Image
Format                                   : JPEG
Width                                    : 1 100 pixels
Height                                   : 619 pixels
Color space                              : YUV
Chroma subsampling                       : 4:2:0
Bit depth                                : 8 bits
Compression mode                         : Lossy
Stream size                              : 81.1 KiB (100%)

General
Complete name                            : 150729145544-03-trump-quotes-super-169_q=1.jpg
Format                                   : JPEG
File size                                : 6.61 KiB

Image
Format                                   : JPEG
Width                                    : 1 100 pixels
Height                                   : 619 pixels
Color space                              : YUV
Chroma subsampling                       : 4:2:0
Bit depth                                : 8 bits
Compression mode                         : Lossy
Stream size                              : 6.61 KiB (100%)

The only difference is in the name and size of the file.

Seeing:

I have Ghostscript 9.15 on OS X 10.8.5. It shouldn't really matter which version of gs or what operating system. gs has beed around for decades and doesn't change that much.

First I used your settings:

Code:

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile="$o" "$i"

That gave inferior results so I had to try different parameters and settings based on prior experience. This is the combination that gave identical results:

Code:

gs -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress -dNOPAUSE -dBATCH -sOutputFile="$o" "$i"

You can see increased mosquito noise in the top image for the default gs settings that you used. The bottom one, done using my settings, is identical to the middle one, the original.

The files you uploaded looked identical but that doesn't mean that your default settings will work in all cases. Here is just one example that shows that it doesn't.

willus · 11-08-2015, 09:57 PM

Quote:

Originally Posted by PHC

The data doesn't mean anything. An image and its copy can have identical specs but be very different in quality. ...

You are correct. What I discovered doing more experimentation is that if the source PDF has lossless bitmap encodings (e.g. LZW/Flate, as I had in my example), then the ghostscript output with "pdfwrite" is a lossless replica. But if the source PDF has bitmaps with lossy encodings, this is not the case--the ghostscript output will be different. Since k2pdfopt defaults to writing LZW/Flate bitmaps, this method has always worked fine on k2pdfopt output, but it is not as effective for PDFs with lossy encodings. (Also, ghostscript does not support writing PDFs with JBIG2/JPX encodings, which you had in your source PDF, due to patent issues, so it had to change these encodings to something else entirely.)

DaleDe · 11-09-2015, 11:21 AM

It would seem that Ghostscript is saving the image rather than copying the binary data into a file. Saving a lossless image will essentially create the same image but a lossy image will certainly change if you create it instead of copying the binary. There are programs that know how to copy the binary. Also copying a binary does not require a license for the format.

Dale

willus · 11-09-2015, 02:44 PM

Quote:

Originally Posted by DaleDe

It would seem that Ghostscript is saving the image rather than copying the binary data into a file. Saving a lossless image will essentially create the same image but a lossy image will certainly change if you create it instead of copying the binary. There are programs that know how to copy the binary. Also copying a binary does not require a license for the format.

Agreed. I got fooled because ghostscript actually tries to maintain the source PDF's encoding method for each image, and because of that, I thought it was just replicating the original image's object data stream from the source file, but apparently it is not. As you said, if the image is lossless to begin with, then there is no impact to re-encoding.

Ora · 11-21-2015, 05:44 PM

Quote:

Originally Posted by PHC

I have no idea. I'd have to try it myself. If the file does not contain any sensitive information, you could upload it to a free cloud service and post the link.

My apologies for answering late, the subscription to the thread seems to be lagging for me.

Anyway, here is the file I've been talking about:
http://www.4shared.com/office/IwRiih...tanje_o_s.html

PHC · 11-21-2015, 06:10 PM

Quote:

Originally Posted by willus

You are correct. What I discovered doing more experimentation is that if the source PDF has lossless bitmap encodings (e.g. LZW/Flate, as I had in my example), then the ghostscript output with "pdfwrite" is a lossless replica. But if the source PDF has bitmaps with lossy encodings, this is not the case--the ghostscript output will be different. Since k2pdfopt defaults to writing LZW/Flate bitmaps, this method has always worked fine on k2pdfopt output, but it is not as effective for PDFs with lossy encodings. (Also, ghostscript does not support writing PDFs with JBIG2/JPX encodings, which you had in your source PDF, due to patent issues, so it had to change these encodings to something else entirely.)

That is very interesting. So it treats even different LOSSY encodings differently apparently. Where is it stated that this particular encoding has a patent issue? Is there documentation that states which codecs are supported?

PHC · 11-21-2015, 06:15 PM

Quote:

Originally Posted by DaleDe

It would seem that Ghostscript is saving the image rather than copying the binary data into a file. Saving a lossless image will essentially create the same image but a lossy image will certainly change if you create it instead of copying the binary. There are programs that know how to copy the binary. Also copying a binary does not require a license for the format.

Dale

AFAIK, cpdf and pdftk do give you a copy if the format is supported. But on this file, even Acrobat outputs a slightly inferior image on extracted pages, as do the other two, so it must be the codec. If a PDF has tiff images, the image quality of extracted pages is identical.

willus · 11-22-2015, 08:39 AM

Quote:

Originally Posted by PHC

That is very interesting. So it treats even different LOSSY encodings differently apparently. Where is it stated that this particular encoding has a patent issue? Is there documentation that states which codecs are supported?

I found it on the ghostscript bug tracking system.

11-09-2015, 11:21 AM	#21
DaleDe Grand Sorcerer Posts: 11,470 Karma: 13095790 Join Date: Aug 2007 Location: Grass Valley, CA Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7	It would seem that Ghostscript is saving the image rather than copying the binary data into a file. Saving a lossless image will essentially create the same image but a lossy image will certainly change if you create it instead of copying the binary. There are programs that know how to copy the binary. Also copying a binary does not require a license for the format. Dale Last edited by DaleDe; 11-09-2015 at 11:28 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Adobe Acrobat X Pro	pavlli	PDF	4	05-13-2011 03:16 AM
Opticbook 3600 pro or standard when using acrobat pro?	circularforward	Workshop	2	01-29-2010 03:05 AM
Kindle DX and Acrobat Pro Crop Box	davidspitzer	Amazon Kindle	4	06-15-2009 12:16 PM
acrobat pro 8.0 on the PRS-500 reader	ambertape	Sony Reader	0	01-21-2008 12:01 PM
Confused with Acrobat Pro and Cropping	jmdor	Sony Reader	6	03-06-2007 10:44 PM