View Full Version : Cropping PDFs


harryo
08-14-2006, 08:34 AM
I was thinking that it would be nice if iRex allowed us to add some cropping information to the manifest file.

It would be great if, on a per document basis, we could add something like (hope this comes through OK) ...

<croptop>20</croptop><cropbottom>20</cropbottom>
<cropleft>50</cropleft><cropright>50</cropright>

The PDF renderer could generate a larger page that is then cropped by the specified number of pixels before displaying.

This would allow the user to reduce the amount of white space around the pages by as much as they prefer.

I'm suggesting this for the case where one only has the PDF (I have a lot of reference books I've purchased in PDF format that I don't have the source for).

ali
08-14-2006, 08:44 AM
I found that all PDF's I tried so far could be cropped manually by editing the MediaBox.

For example, an A4 PDF will contain something like
/MediaBox [ 0 0 595 842 ]
in one or several places. I just adjusted the four values - to crop 50 points from each border, I'd change that to
/MediaBox [ 50 50 545 792 ]

Both acroread and xpdf will honor that. That way, you can even crop page numbers / chapter headings and such away.

Under Linux, I'd do it using sed:
sed 's/0 0 595 842/50 50 545 792/g' <book.pdf >cropped.pdf

Under Windows, you'd have to open the pdf in notepad or something (don't know if that harms binary streams in the pdf)

yokos
08-14-2006, 08:46 AM
I was thinking that it would be nice if iRex allowed us to add some cropping information to the manifest file.

It would be great if, on a per document basis, we could add something like (hope this comes through OK) ...

<croptop>20</croptop><cropbottom>20</cropbottom>
<cropleft>50</cropleft><cropright>50</cropright>

Nice idea! Yepp, not all iLiad users have access to commercial software like Acrobat to crop pdf files.

Cropping of pdf files was dicussed earlier in this forum, I don't know, what the result was [existence of an OS program].

Uh, just change the mediabox tag. This is so easy? Cool!

harryo
08-14-2006, 08:51 AM
I found that all PDF's I tried so far could be cropped manually by editing the MediaBox.

For example, an A4 PDF will contain something like
/MediaBox [ 0 0 595 842 ]
in one or several places. I just adjusted the four values - to crop 50 points from each border, I'd change that to
/MediaBox [ 50 50 545 792 ]

Both acroread and xpdf will honor that. That way, you can even crop page numbers / chapter headings and such away.


Fantastic! I'll try whipping up a Ruby script to do this automatically for me.

Thanks for the information.

yokos
08-14-2006, 09:09 AM
Well, I tried this; I didn't worked for me. Adobe Reader can't read this file anymore. Perhaps I did it the wrong way.

ali
08-14-2006, 09:13 AM
Well, I tried this; I didn't worked for me. Adobe Reader can't read this file anymore. Perhaps I did it the wrong way.

What did you do, and how?

harryo
08-14-2006, 12:00 PM
Well, I had a certain amount of success with the MediaBox hacking approach.

However, for some of my PDFs, while this change seems to have had the desired effect when looking at them in Acrobat, after transfer to the Iliad, there seems to be extra white space on the left and right, somewhat negating the desired result.

I've not worked out yet how the ones that work well differ from those that exhibit the extra white space, but I'll keep working at it.

At least I've been able to get rid of extraneous white space at top and bottom in every document, which does make a difference.

ElaHuguet
08-14-2006, 12:04 PM
The iLiad fits the pdf into its screen respecting the document's proportions, so maybe that's why you're still seeing white side margins.

edit: ie. cropping the side margins on a 15cm long pdf will make absolutely no difference.

harryo
08-14-2006, 12:07 PM
What did you do, and how?

I don't know how ali did it, but I used this piece of Ruby code

cropLeft = ARGV[0].to_i
cropRight = ARGV[1].to_i
cropTop = ARGV[2].to_i
cropBottom = ARGV[3].to_i

originalPath = ARGV[4]
croppedPath = originalPath.sub(/\.pdf$/, " (cropped).pdf")

File.open(originalPath, "rb") do |input|
pdf = input.read

pdf.gsub!(/\/MediaBox \[\s*(\d*)\s*(\d*)\s*(\d*)\s*(\d*)\s*\]/) do
n1 = $1.to_i + cropLeft
n2 = $2.to_i + cropTop
n3 = $3.to_i - cropRight
n4 = $4.to_i - cropBottom

"/MediaBox [#{n1} #{n2} #{n3} #{n4}]"
end

File.open(croppedPath, "wb") { |output| output.write pdf }
end


I don't have a reference for the /MediaBox parameters, so I did it by trial and error. In reality, I didn't take much care, just ran the program and fiddled the four values until I got what looked good for each document.

I just wanted to get something working. I'll clean it up later, once I have the details worked out a bit better.

ath
08-14-2006, 12:50 PM
I found that all PDF's I tried so far could be cropped manually by editing the MediaBox.

PDF files conforming to older versions of the format can be done that way ... as long as you do not add or delete any characters! -- that will screw up the object offset table, and it's not certain that recovery will work. (If you have some other Box nearby, containing the same data, you may be able to remove it to get the extra character space: if I remember, all Boxes default to MediaBox if they aren't specified.)

Of course, if your editor changes the original newlines behind your back, it will mess up thing even more. PDF files should be treated as binary files.

ali
08-14-2006, 01:09 PM
Of course, if your editor changes the original newlines behind your back, it will mess up thing even more. PDF files should be treated as binary files.

That explains the "bad xref table" errors I was getting. Xpdf complains "Error (0): PDF file is damaged - attempting to reconstruct xref table..." but displays pages correctly afterwards.

"Save a copy" in acroread doesn't fix it... Hmmmmmmmmmmmmmmmm Have to think about this.

arivero
08-14-2006, 01:51 PM
Also a pdfcrop utility exist in the TeX packages, but it generates files bigger.

k2r
08-14-2006, 02:02 PM
I don't know if anybody mentioned if, but on MacOSX you can just open your PDF in preview.app, select an area on a page, see if the content-area on the other pages are complete and have preview crop all pages using the same stencil.
Then use "save as..." to save a cropped copy.
5 mouseclicks.

k2r

harryo
08-14-2006, 05:53 PM
The iLiad fits the pdf into its screen respecting the document's proportions, so maybe that's why you're still seeing white side margins.

edit: ie. cropping the side margins on a 15cm long pdf will make absolutely no difference.

Ah, yes. That could explain it. It may be the ones where I didn't need to crop top and bottom much that are showing more white space on the left and right than I'd expected.

harryo
08-14-2006, 05:58 PM
Also a pdfcrop utility exist in the TeX packages, but it generates files bigger.

That sounds like the perfect way to do it.

I'm not overly worried about the files being larger, unless we're talking an order of magnitude. The 100 or so MB in the Iliad is enough to hold a lot of reference material. My largest book is only about 8MB and that's over 800 pages worth.

I'll give that a go tonight.

harryo
08-14-2006, 06:01 PM
I don't know if anybody mentioned if, but on MacOSX you can just open your PDF in preview.app, select an area on a page, see if the content-area on the other pages are complete and have preview crop all pages using the same stencil.
Then use "save as..." to save a cropped copy.
5 mouseclicks.

k2r

Just one more reason to love OS X. I only bought my first Mac about 18 months ago, but have become a serious fan. So nice to have both a beautiful interface and *nix underneath.

I'll definitely try this method. It would be a lot nicer than the trial and error I've been doing with my Ruby script up until now.

k2r
08-14-2006, 06:31 PM
Just an additional idea: If the resulting pdf is too big you could define a custom colorsync filter that grayscales and compresses the images and changes the resolution. The force is strong with preview/colorsync.

arivero
08-15-2006, 08:12 AM
Just one more reason to love OS X. I only bought my first Mac about 18 months ago, but have become a serious fan. So nice to have both a beautiful interface and *nix underneath.

I'll definitely try this method. It would be a lot nicer than the trial and error I've been doing with my Ruby script up until now.

Please do a "diff" afterwards so we all we know what lines are modifyed in this method. And compare sizes too. :D

kusmi
08-15-2006, 09:01 AM
Mhh, it seems MacOSX alters the complete PDF, it adds something binary as well (beside the CropBox directive) into the cropped pdf...

scotty1024
08-15-2006, 07:57 PM
I don't know if anybody mentioned if, but on MacOSX you can just open your PDF in preview.app, select an area on a page, see if the content-area on the other pages are complete and have preview crop all pages using the same stencil.

Mac OS X Preview will also let you set and manage bookmarks in the PDF file.

I've been fooling around with this trying to see if I can get a spark or interest out of the iLiad. :)

kusmi
08-16-2006, 05:56 AM
What is also nice: Sometimes I have PDFs, where the text is in gray or light color, then you can use the ColorSync filters in Macosx to generate a profile to increase the contrast of the text.

ali
08-18-2006, 08:41 AM
For those who are still stuck with cropping by editing the MediaBox: I found that pdftk (http://www.accesspdf.com/pdftk/) repairs pdf files where the "xref table" is broken due to tampering with the MediaBox. Just do a "pdftk broken.pdf output good.pdf".

yokos
09-01-2006, 11:12 AM
For those who are still stuck with cropping by editing the MediaBox: I found that pdftk (http://www.accesspdf.com/pdftk/) repairs pdf files where the "xref table" is broken due to tampering with the MediaBox. Just do a "pdftk broken.pdf output good.pdf".
Nice to know.

[OT] »Brick (expensive kind)« gallows humour in it's best form. Best wishes, hope you out of this trouble fast!

arivero
09-12-2006, 08:36 AM
Well, I tried this; I didn't worked for me. Adobe Reader can't read this file anymore. Perhaps I did it the wrong way.

For instance it doesnt work with sed, because it wipes away the null characters or something so. It is a pity because

cat prueba.pdf | sed 's/MediaBox \[.*\]/MediaBox \[0 0 300 500\]/g' > prueba2b.pdf

sounds elegant. I wonder if there is some utility to transform binary pdf files to text and back.

ali
09-12-2006, 12:28 PM
For instance it doesnt work with sed, because it wipes away the null characters or something so. It is a pity because

cat prueba.pdf | sed 's/MediaBox \[.*\]/MediaBox \[0 0 300 500\]/g' > prueba2b.pdf

sounds elegant. I wonder if there is some utility to transform binary pdf files to text and back.

Weird. The sed on my system has no problems with \0. perl might help:
perl -e 'while(<STDIN>){s/XXX/YYY/g;print;};'is equivalent to sed s/XXX/YYY/g

Are you sure that sed is your problem? Are you aware that the regexp you posted is erraneous? It matches from the first "MediaBox [" to the last "]" in a line, which might be the whole file. (sed always chooses the longest possible match of a regexp) This is better:
sed 's/MediaBox *\[[^]]*\]/MediaBox [0 0 300 500]/g'

Finally all crashes of acroread after tampering with MediaBoxes could be traced back to a broken xref table, which can be reconstructed using pdftk (see earlier post).

yokos
10-27-2006, 07:10 AM
Maybe it is worth for you to have a look here (http://www.accesspdf.com/article.php?story=20041130152129869). There you can download a pdftk plug-in for Vim [an "advanced text editor"].

Vim users can also install my plug-in for easily editing PDF code. When you open a PDF in Vim, the plug-in calls pdftk to uncompress the page streams, so they are editable. When you save the PDF, the plug-in uses pdftk to repair and re-compress the PDF.
This is quite handy!

You can find Vim versions for Windows, Linux, & Mac.

vstefanyuk
01-09-2008, 09:43 AM
PDFCropper is the application, designed to solve the problem with preparing for reading normal sized (A4, B4, C4, letter etc.) pdf's on relative small (Sony Reader PRS500/PRS505, iRex Illiad etc.) devices.

The problem is that pdf is not reformat able by nature. Yes, there is reflow mode in Acrobat Reader, but at first Acrobat Reader is not available for most e-book readers (especially for e-ink devices),
and second even with reflow function reading of complex content (technical books, magazines etc.) is not comfortable. Bad formatted pdf's and wide white spaces make the situation even worse.

The only way how this problem can be solved (at least based on my experience) to cut original pages into smaller pages with removing white spaces.
This is exactly what program do. But comparing with similar software PDFCropper is much more flexible, that allows to prepare books with much better quality in a very short time.

PDFCropper can produce text and image pdf's.

PDFCropper web-site currently is under construction. But it is already available for downloading:

PDFCropper v1.1 RC1 links:

http://rapidshare.com/files/87491638/PDFCropper1_1RC1Setup.exe.html
or
http://www.filefactory.com/file/a07bc5/
or
http://www.megaupload.com/?d=JI7CM0N4

Trial version of PDFCropper is fully functional, the only thing - output pages include watermark (which by the way displays registration code that you need for obtaining license).

And you can download and value an example of prepared (cropped) pdf:

This is original one - http://java.sun.com/docs/books/jls/download/langspec-3.0.pdf

These are links to cropped one:

http://rapidshare.com/files/87301097/The_Java_Language_Specification.pdf.html
or
http://www.filefactory.com/file/252400/
or
http://www.megaupload.com/?d=8KW32DJT

And these are links to the one cropped as image pages (only some pages are there, due large file size):

http://rapidshare.com/files/87486857/The_Java_Language_Specification__images__partly_.p df.html
or
http://www.filefactory.com/file/d02ead/
or
http://www.megaupload.com/?d=MJCYT8KH

Also, anyone interesting in software, can send me example pdf, and I will send back resulted pdf prepared via application.

License price per one computer is 20 euro. Payment can be made via PayPal or directly to bank account.

You can ask any questions about using or installing software and details about purchasing the license via e-mail:

vstefanyuk@gmail.com

P.S. Application is implemented by using Java. You have to have installed Java environment version 1.5 or higher.
Also Ghostscript has to be installed. In case if it is not, application will propose you to download.

Change log:
v1.1 RC1 - fixed the bug with adding table of contents (now appropriate menu item in Sony PRS 500/505 display table of contents correctly).
added separate option for adding clickable table of contents pages.
added displaying of table of contents in the left pane.
improved auto-framing functionality - added possiblty to create frames with size proportional to device size (paper size setting in Crop Settings dialog).
improved cropping funcionality - added possibility to remove empty frames in resulting pdf ("Skip empty frames" option in Crop dialog).
improved dialogs input errors displaying, errorness fields are highlighted by red.
fixed bug with deleting last unaccessible project reference from recent projects list.
added pdf restrictions checking during opening pdf. in case if there are no enough rights for pdf manipulating - program will ask pdf user password.
example pdf's are updated.
simplified trial mode - removed pages shuffling!

canicula
01-31-2008, 05:52 PM
I don't know if anybody mentioned if, but on MacOSX you can just open your PDF in preview.app, select an area on a page, see if the content-area on the other pages are complete and have preview crop all pages using the same stencil.
Then use "save as..." to save a cropped copy.
5 mouseclicks.

Has this changed in the leopard version of Preview? I can't see a way to crop multiple pages at the same time.

Ian

canicula
01-31-2008, 05:59 PM
Has this changed in the leopard version of Preview? I can't see a way to crop multiple pages at the same time.

Ignore that. It does work in Leopard. Use the select tool on one page, then select the other pages in the sidebar thumbnail views, then crop. That does them all. Cool.

Ian

PieterH
01-31-2008, 10:39 PM
I've been reading this site for some time and I've found a solution for expanding pdf's by "tiling". This allows one page to be expanded to two or more pages making a zoomed in version of a pdf.

This can be done with MindCad Tiler:
http://mac.softpedia.com/progDownload/MindCad-Tiler-Download-9560.html
http://www.mindcad.com/tiler.html
which is only for mac (sry)
or
Adobe Acrobat Professional

-in Tiler open your pdf, set page setup to print in landscape mode (on side) and save as pdf. You will need to repeat this for every page...

-in Adobe, select print; in the options screen that opens set page scaling to "tile all pages" then set zoom to 180% (fits one page onto two pages nicely) and then in output options select "save as file" pdf with pdf formatting

Both of these options produce neatly cut pages but the text recognition is damaged in some, not all, programs. Sometimes text can be cropped a bit but the remaining text looks fine.

I haven't tried this on a ebook reader yet as I am still undecided but this pdf size problem shouldn't be too bad of a turn off if you are into reading articles where pdf formatting is important.

amgoforth
03-26-2009, 12:39 AM
I don't know if anybody mentioned if, but on MacOSX you can just open your PDF in preview.app, select an area on a page, see if the content-area on the other pages are complete and have preview crop all pages using the same stencil.
Then use "save as..." to save a cropped copy.
5 mouseclicks.

k2r

Thanks for the info. Although I am a long time mac user I didn't know that. I am trying it now, but it is taking so long that the file will probably be way too large. I am working with Google and Internet Archive pdfs.:thanks:

sdax
11-13-2009, 04:01 AM
Try my 'pdfhelpers' scripts. This is my first work in Python, but it do all what i need to do with PDFs.

pdfsort - groups pdfs by formats (input and output paths are in pdfhelpers.conf)
pdfcrop <directory_with_pdfs_to_crop> - crops white margins

Depends on Python 2.6.x, PyPDF 1.12, GPL Ghostscript 8.xx, pdftk 1.41

See pdfhelpers.conf for options.

sdax
11-20-2009, 07:22 AM
New version. With pdf2up.jar module (based on iText http://www.lowagie.com/iText/) - for booklet collecting same formats.

Depends on Python 2.6.x, PyPDF 1.12, GPL Ghostscript 8.xx, pdftk 1.41, Java RE

See pdfhelpers.conf for options.

badbob001
11-20-2009, 11:41 AM
This would be a whole lot easier and automatic if the PDF viewer has an option to zoom to fit visible, thus ignoring the white spaces. Are there any PDF viewers with this option? A tolerance slider would be nice so it can ignore the page number in the corner.

The irex delta chip, when rendering the screen, should know exactly where the white spaces are... they should consider adding some sort of auto-zoom mode for the rendering pipeline.