View Full Version : "Cleansing" of PDF files


psychomike
02-17-2010, 03:14 PM
hello all,

before i sit down and create the tool myself, i was wondering if anyone has discovered a batch-input tool (command line would be nice, folder/directory scan, if not...) that will auto-magically strip out any AcroForm/Javascript embedded in a PDF ?

I would like to do this to correct the HUGE security hole that exists, so i do not accidentally pass on a file with a bug/virus embedded in a PDF. (it can be as simple as a 'tracker' or as sophisticated as a 'phone home w/ user email', etc).

I am currently using okular (i use a KDE desktop) set with heavy restrictions to open the file initially, and if an AcroForm is present, i then re-open the file with PDFedit to delete the form (where there are no restrictions set on the PDF file). this is rather tedious, to say the least.

and being the lazy person i am, i was hoping that others will have solved
this problem for me !

and if anyone knows a good tool to remove the restrictions (batch would be nice...) on PDF files, that would also make things easier that PDF --> lrf --> PDF !

thank you, in advance, for any help/pointers,

-michael

frabjous
02-17-2010, 03:37 PM
The command line tool pdftk (http://www.accesspdf.com/pdftk/) should be able to remove the forms via "flatten".

Not sure if it can be used to remove javascript. Hmm.... personally I'd probably try one of the following two things, but only because I don't know any better: use the pdflatex pdfpages (http://ctan.org/pkg/pdfpages) to include the pdf in a "new one", which I'm pretty sure (not positive) would be stripped of its javascript in the process (which could certainly be scripted from the commandline for batch processing), or use the ghostscript (http://pages.cs.wisc.edu/~ghost/) commands pdf2ps and ps2pdf to convert from pdf to ps and back again, which I think would have the effect of removing the javascript (and preserve a lot more than converting to lrf would!), and both could be put in, e.g., a bash script easy enough.

I'm sure there are better things to try, though.

Do you have a PDF with javascript in it I can test with?

(EDIT: I tested both methods with the javascript calculator PDF here (http://www.planetpdf.com/developer/article.asp?ContentID=6575) and both successfully broke the calculator, but I'm not sure whether or not any javascript was left or not.)

(EDIT 2: I uncompressed the results of both methods with pdftk and examined the results and didn't see any javascript in either, but I'm not the most competent judge.)

(EDIT 3: Someone cleverer than I could probably use pdftk to uncompress the PDF then use a command line text/stream editor like sed or awk to strip the javascript then recompress.)

frabjous
02-17-2010, 04:19 PM
Aha!

Even better: PDF Java Script stripper (http://pdfjavascriptst.sourceforge.net/) -- a Java program; should be platform-independent.

(Actually, unless you're already familiar with iText, that looks a bit intimidating... well... take your pick.)

psychomike
02-18-2010, 12:15 PM
frabjous,

first thank you for the speedy response.

after looking at pdfstripper.jar, it does not look like the tool i was hoping for. i would have to rename my before i run it, and rename it after it is run. it appears to be a wrapper around the itext class library, that is then in turn run by a wrapper around a java tool (ant). i was hoping for something a little 'cleaner' (i.e. <command> <input pdf> <output pdf> ).

(this is more of a philosophical thing. i am the lead architect on a 300k plus line java-based open source project...)

i am wondering if 'pdftk flatten' is the correct tool, as it has not been updated in quite some time (nov 2006) , and might not be 'script aware' in most cases.

looks like my choices are (in no order)


convert to a better 'print ready' format, and back again.
bite the bullet and write a program that uses the itext class myself.
(in all my copious free time not spent changing diapers, babysitting engineers - no, they are not the ones in diapers, and launching a startup.)
be the one to write the bash script that uses pdftk to flatten --> filter out java script using sed/awk/etc --> re-compress



*sigh*

and on a timely note, saw this article referenced on slashdot a few hours after i posted my plea for help:

Rogue PDFs account for 80% of all exploits, says researcher (http://www.computerworld.com/s/article/9157438/Rogue_PDFs_account_for_80_of_all_exploits_says_res earcher)


from the article:

Computerworld - Just hours before Adobe is slated to deliver the latest patches for its popular PDF viewer, a security firm announced that by its counting, malicious Reader documents made up 80% of all exploits at the end of 2009.

According to ScanSafe of San Bruno, Calif., vulnerabilities in Adobe's Reader and Acrobat applications were the most frequently targeted of any software during 2009, with hackers' PDF exploits growing throughout the year.

In the first quarter of 2009, malicious PDF files made up 56% of all exploits tracked by ScanSafe. That figure climbed above 60% in the second quarter, over 70% in the third and finished at 80% in the fourth quarter.


looks like i need this tool more than ever !

-michael

frabjous
02-18-2010, 03:27 PM
Couldn't you automate the renaming and the rest of it with a bash script (or python/perl/sh, etc.)?

Well, if you do decide to write something yourself, let us know!

The other methods I mentioned are probably good enough for my purposes.

What is this open source project if I may be nosy?