Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 11-08-2015, 12:11 AM   #16
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by PHC View Post
I bring it up because it re-encodes images and loses the TOC. Both of those are undesirable and eliminated by cpdf. And cpdf is much easier to use.
The method I posted does not re-encode the images in the PDF.

[Edit 9 Nov 15: My claim here is not correct. Ghostscript with "pdfwrite" output does re-encode images, so if the source PDF encodings are lossy, the output images can change--continue reading in thread.]

Last edited by willus; 11-09-2015 at 08:11 AM.
willus is offline   Reply With Quote
Old 11-08-2015, 01:31 AM   #17
PHC
Member
PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.
 
Posts: 21
Karma: 15000
Join Date: Feb 2014
Device: iPhone, iPad, Macbook Pro, Mac Pro
Quote:
Originally Posted by willus View Post
The method I posted does not re-encode the images in the PDF.
OK, I just did a quick test. I extracted 10 pages from a scanned OCRed PDF using Acrobat. I then used your exact parameters, which are just the default ones you probably blindly copied from another post. Though I do that initially when I want to try something, I will then go and read the documentation and learn what other parameters I need to pay attention to. First off, the input file was 902kB, while the output file was 725kB. Second, I got an error:
Code:
GPL Ghostscript 9.15: Missing glyph CID=0, glyph=0067 in the font HiddenHorzOCR . The output PDF may fail with some viewers.
I then opened both files in Acrobat and maximized them and did a simple A-B comparison using the keyboard to switch rapidly back and forth multiple times. I looked at various pages and chose the better looking file without knowing which was which. Anything that was monochrome (black on white) was indistinguishable but grayscale images in the gs file definitely showed noticeable mosquito noise around lines and edges. So smaller file, missing glyph, noise, all indicators of lossiness. I didn't try it on text or vector graphics. It would no doubt be closer but probably not identical. In any case, using default settings for gs for every case is a mistake. You need to tweak the settings for each case. I've done countless hours of testing with many different settings and made notes on the results. Have you?

BTW, if I simply copy the PDF with cpdf to a new file, it is identical.

Last edited by PHC; 11-08-2015 at 09:04 AM.
PHC is offline   Reply With Quote
Old 11-08-2015, 10:35 AM   #18
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by PHC View Post
OK, I just did a quick test. I extracted 10 pages from a scanned OCRed PDF using Acrobat...
Please post your source and converted files, the exact command you used, the version of Ghostscript you used, and the OS you did the conversion on. This method was sent to me in an e-mail and I verified that it worked. So "copied" yes, "blindly copied", no. Why would I change something that works? Attached are my examples. Command used (on Windows 10 PC):

Code:
C:\>gs
GPL Ghostscript 9.02 (2011-03-30)
Copyright (C) 2010 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
GS>quit

C:\>gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=pooh_ghostscript_out.pdf pooh_src.pdf
Resulting PDF information in each file is below. Notice that the bitmap of the page has identical resolution, depth, and encoding method.

Code:
C:\>k2pdfopt -i pooh_src.pdf
k2pdfopt v2.33a (w/MuPDF,DjVuLibre,OCR) (c) 2015, GPLv3, http://willus.com
    Compiled Oct  3 2015 with Gnu C (Mingw64) v5.2.0 for Win64 on x64.

FILE:           pooh_src.pdf
PDF VERSION:    1.3
TITLE:          pooh.pdf
CREATED:        D:20100328132840
LAST MODIFIED:  D:20151108072341-08'00
PDF PRODUCER:   K2pdfopt v2.33a
FILE SIZE:      346.4 kB (354,746 bytes)
PAGES:          1

       Page       Ref           Details
Mediaboxes (1):
        1       (2 0 R):        [ 0 0 455.5 579.4 ] (6.33 x 8.05 in)

Fonts (2):
        1       (2 0 R):        Type1 'Helvetica' (0 0 R)
        1       (2 0 R):        Type1 'Helvetica' (0 0 R)

Images (1):
        1       (2 0 R):        [ Flate ] 949x1207 4bpc DevRGB (5 0 R)


C:\>k2pdfopt -i pooh_ghostscript_out.pdf
k2pdfopt v2.33a (w/MuPDF,DjVuLibre,OCR) (c) 2015, GPLv3, http://willus.com
    Compiled Oct  3 2015 with Gnu C (Mingw64) v5.2.0 for Win64 on x64.

FILE:           pooh_ghostscript_out.pdf
PDF VERSION:    1.4
CREATED:        D:20151108072511-08'00'
LAST MODIFIED:  D:20151108072511-08'00'
PDF PRODUCER:   GPL Ghostscript 9.02
FILE SIZE:      426.0 kB (436,225 bytes)
PAGES:          1

       Page       Ref           Details
Mediaboxes (1):
        1       (4 0 R):        [ 0 0 455.5 579.4 ] (6.33 x 8.05 in)

Fonts (1):
        1       (4 0 R):        Type1 'Helvetica' (9 0 R)

Images (1):
        1       (4 0 R):        [ Flate ] 949x1207 4bpc DevRGB (8 0 R)
PS. I have never argued that cpdf cannot make identical copies of a PDF or that ghostscript is better at it. I originally posted because the OP wanted to remove cropped content, and the method I posted removes cropped-out text (but not cropped-out images). If cpdf can remove cropped-out areas (without help from Acrobat), then please post how, otherwise I consider it irrelevant to why I posted.
Attached Files
File Type: pdf pooh_src.pdf (346.4 KB, 483 views)
File Type: pdf pooh_ghostscript_out.pdf (426.0 KB, 433 views)

Last edited by willus; 11-08-2015 at 10:42 AM.
willus is offline   Reply With Quote
Old 11-08-2015, 08:48 PM   #19
PHC
Member
PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.
 
Posts: 21
Karma: 15000
Join Date: Feb 2014
Device: iPhone, iPad, Macbook Pro, Mac Pro
Thumbs down

Code:
Resulting PDF information in each file is below.  Notice that the bitmap of the page has identical resolution, depth, and encoding method.
The data doesn't mean anything. An image and its copy can have identical specs but be very different in quality. The proof is in the seeing. Here's a simple example:

I used imagemagick to deliberately reduce the quality of an image. These are the specs of both images:

Code:
General
Complete name                            : 150729145544-03-trump-quotes-super-169.jpg
Format                                   : JPEG
File size                                : 81.1 KiB

Image
Format                                   : JPEG
Width                                    : 1 100 pixels
Height                                   : 619 pixels
Color space                              : YUV
Chroma subsampling                       : 4:2:0
Bit depth                                : 8 bits
Compression mode                         : Lossy
Stream size                              : 81.1 KiB (100%)

General
Complete name                            : 150729145544-03-trump-quotes-super-169_q=1.jpg
Format                                   : JPEG
File size                                : 6.61 KiB

Image
Format                                   : JPEG
Width                                    : 1 100 pixels
Height                                   : 619 pixels
Color space                              : YUV
Chroma subsampling                       : 4:2:0
Bit depth                                : 8 bits
Compression mode                         : Lossy
Stream size                              : 6.61 KiB (100%)
The only difference is in the name and size of the file.

Seeing:



I have Ghostscript 9.15 on OS X 10.8.5. It shouldn't really matter which version of gs or what operating system. gs has beed around for decades and doesn't change that much.

First I used your settings:
Code:
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile="$o" "$i"
That gave inferior results so I had to try different parameters and settings based on prior experience. This is the combination that gave identical results:
Code:
gs -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress -dNOPAUSE -dBATCH -sOutputFile="$o" "$i"


You can see increased mosquito noise in the top image for the default gs settings that you used. The bottom one, done using my settings, is identical to the middle one, the original.

The files you uploaded looked identical but that doesn't mean that your default settings will work in all cases. Here is just one example that shows that it doesn't.
Attached Files
File Type: pdf Input Acrobat crop gs.pdf (708.5 KB, 442 views)
File Type: pdf Input Acrobat crop.pdf (881.0 KB, 579 views)
File Type: pdf Input Acrobat crop gs [prepress].pdf (838.7 KB, 430 views)

Last edited by PHC; 11-08-2015 at 08:53 PM.
PHC is offline   Reply With Quote
Old 11-08-2015, 09:57 PM   #20
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by PHC View Post
The data doesn't mean anything. An image and its copy can have identical specs but be very different in quality. ...
You are correct. What I discovered doing more experimentation is that if the source PDF has lossless bitmap encodings (e.g. LZW/Flate, as I had in my example), then the ghostscript output with "pdfwrite" is a lossless replica. But if the source PDF has bitmaps with lossy encodings, this is not the case--the ghostscript output will be different. Since k2pdfopt defaults to writing LZW/Flate bitmaps, this method has always worked fine on k2pdfopt output, but it is not as effective for PDFs with lossy encodings. (Also, ghostscript does not support writing PDFs with JBIG2/JPX encodings, which you had in your source PDF, due to patent issues, so it had to change these encodings to something else entirely.)

Last edited by willus; 11-08-2015 at 10:37 PM. Reason: Changed response after more testing.
willus is offline   Reply With Quote
Old 11-09-2015, 11:21 AM   #21
DaleDe
Grand Sorcerer
DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.
 
DaleDe's Avatar
 
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
It would seem that Ghostscript is saving the image rather than copying the binary data into a file. Saving a lossless image will essentially create the same image but a lossy image will certainly change if you create it instead of copying the binary. There are programs that know how to copy the binary. Also copying a binary does not require a license for the format.

Dale

Last edited by DaleDe; 11-09-2015 at 11:28 AM.
DaleDe is offline   Reply With Quote
Old 11-09-2015, 02:44 PM   #22
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by DaleDe View Post
It would seem that Ghostscript is saving the image rather than copying the binary data into a file. Saving a lossless image will essentially create the same image but a lossy image will certainly change if you create it instead of copying the binary. There are programs that know how to copy the binary. Also copying a binary does not require a license for the format.
Agreed. I got fooled because ghostscript actually tries to maintain the source PDF's encoding method for each image, and because of that, I thought it was just replicating the original image's object data stream from the source file, but apparently it is not. As you said, if the image is lossless to begin with, then there is no impact to re-encoding.
willus is offline   Reply With Quote
Old 11-21-2015, 05:44 PM   #23
Ora
Junior Member
Ora can read faster than his screen refreshesOra can read faster than his screen refreshesOra can read faster than his screen refreshesOra can read faster than his screen refreshesOra can read faster than his screen refreshesOra can read faster than his screen refreshesOra can read faster than his screen refreshesOra can read faster than his screen refreshesOra can read faster than his screen refreshesOra can read faster than his screen refreshesOra can read faster than his screen refreshes
 
Posts: 3
Karma: 14228
Join Date: Oct 2015
Device: none
Quote:
Originally Posted by PHC View Post
I have no idea. I'd have to try it myself. If the file does not contain any sensitive information, you could upload it to a free cloud service and post the link.
My apologies for answering late, the subscription to the thread seems to be lagging for me. Anyway, here is the file I've been talking about:
http://www.4shared.com/office/IwRiih...tanje_o_s.html
Ora is offline   Reply With Quote
Old 11-21-2015, 06:10 PM   #24
PHC
Member
PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.
 
Posts: 21
Karma: 15000
Join Date: Feb 2014
Device: iPhone, iPad, Macbook Pro, Mac Pro
Quote:
Originally Posted by willus View Post
You are correct. What I discovered doing more experimentation is that if the source PDF has lossless bitmap encodings (e.g. LZW/Flate, as I had in my example), then the ghostscript output with "pdfwrite" is a lossless replica. But if the source PDF has bitmaps with lossy encodings, this is not the case--the ghostscript output will be different. Since k2pdfopt defaults to writing LZW/Flate bitmaps, this method has always worked fine on k2pdfopt output, but it is not as effective for PDFs with lossy encodings. (Also, ghostscript does not support writing PDFs with JBIG2/JPX encodings, which you had in your source PDF, due to patent issues, so it had to change these encodings to something else entirely.)
That is very interesting. So it treats even different LOSSY encodings differently apparently. Where is it stated that this particular encoding has a patent issue? Is there documentation that states which codecs are supported?
PHC is offline   Reply With Quote
Old 11-21-2015, 06:15 PM   #25
PHC
Member
PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.PHC is as sexy as a twisted cruller doughtnut.
 
Posts: 21
Karma: 15000
Join Date: Feb 2014
Device: iPhone, iPad, Macbook Pro, Mac Pro
Quote:
Originally Posted by DaleDe View Post
It would seem that Ghostscript is saving the image rather than copying the binary data into a file. Saving a lossless image will essentially create the same image but a lossy image will certainly change if you create it instead of copying the binary. There are programs that know how to copy the binary. Also copying a binary does not require a license for the format.

Dale
AFAIK, cpdf and pdftk do give you a copy if the format is supported. But on this file, even Acrobat outputs a slightly inferior image on extracted pages, as do the other two, so it must be the codec. If a PDF has tiff images, the image quality of extracted pages is identical.
PHC is offline   Reply With Quote
Old 11-22-2015, 08:39 AM   #26
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by PHC View Post
That is very interesting. So it treats even different LOSSY encodings differently apparently. Where is it stated that this particular encoding has a patent issue? Is there documentation that states which codecs are supported?
I found it on the ghostscript bug tracking system.
willus is offline   Reply With Quote
Reply

Tags
acrobat, crop, cropping pdf, pdf


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Adobe Acrobat X Pro pavlli PDF 4 05-13-2011 03:16 AM
Opticbook 3600 pro or standard when using acrobat pro? circularforward Workshop 2 01-29-2010 03:05 AM
Kindle DX and Acrobat Pro Crop Box davidspitzer Amazon Kindle 4 06-15-2009 12:16 PM
acrobat pro 8.0 on the PRS-500 reader ambertape Sony Reader 0 01-21-2008 12:01 PM
Confused with Acrobat Pro and Cropping jmdor Sony Reader 6 03-06-2007 10:44 PM


All times are GMT -4. The time now is 05:16 PM.


MobileRead.com is a privately owned, operated and funded community.