|  02-17-2010, 03:14 PM | #1 | 
| Junior Member  Posts: 2 Karma: 10 Join Date: Nov 2008 Location: Silicon Valley Device: PRS-700, G1 | 
				
				"Cleansing" of PDF files
			 
			
			hello all,  before i sit down and create the tool myself, i was wondering if anyone has discovered a batch-input tool (command line would be nice, folder/directory scan, if not...) that will auto-magically strip out any AcroForm/Javascript embedded in a PDF ? I would like to do this to correct the HUGE security hole that exists, so i do not accidentally pass on a file with a bug/virus embedded in a PDF. (it can be as simple as a 'tracker' or as sophisticated as a 'phone home w/ user email', etc). I am currently using okular (i use a KDE desktop) set with heavy restrictions to open the file initially, and if an AcroForm is present, i then re-open the file with PDFedit to delete the form (where there are no restrictions set on the PDF file). this is rather tedious, to say the least. and being the lazy person i am, i was hoping that others will have solved this problem for me ! and if anyone knows a good tool to remove the restrictions (batch would be nice...) on PDF files, that would also make things easier that PDF --> lrf --> PDF ! thank you, in advance, for any help/pointers, -michael | 
|   |   | 
|  02-17-2010, 03:37 PM | #2 | 
| Wizard            Posts: 1,213 Karma: 12890 Join Date: Feb 2009 Location: Amherst, Massachusetts, USA Device: Sony PRS-505 | 
			
			The command line tool pdftk should be able to remove the forms via "flatten". Not sure if it can be used to remove javascript. Hmm.... personally I'd probably try one of the following two things, but only because I don't know any better: use the pdflatex pdfpages to include the pdf in a "new one", which I'm pretty sure (not positive) would be stripped of its javascript in the process (which could certainly be scripted from the commandline for batch processing), or use the ghostscript commands pdf2ps and ps2pdf to convert from pdf to ps and back again, which I think would have the effect of removing the javascript (and preserve a lot more than converting to lrf would!), and both could be put in, e.g., a bash script easy enough. I'm sure there are better things to try, though. Do you have a PDF with javascript in it I can test with? (EDIT: I tested both methods with the javascript calculator PDF here and both successfully broke the calculator, but I'm not sure whether or not any javascript was left or not.) (EDIT 2: I uncompressed the results of both methods with pdftk and examined the results and didn't see any javascript in either, but I'm not the most competent judge.) (EDIT 3: Someone cleverer than I could probably use pdftk to uncompress the PDF then use a command line text/stream editor like sed or awk to strip the javascript then recompress.) Last edited by frabjous; 02-17-2010 at 04:12 PM. | 
|   |   | 
| Advert | |
|  | 
|  02-17-2010, 04:19 PM | #3 | 
| Wizard            Posts: 1,213 Karma: 12890 Join Date: Feb 2009 Location: Amherst, Massachusetts, USA Device: Sony PRS-505 | 
			
			Aha! Even better: PDF Java Script stripper -- a Java program; should be platform-independent. (Actually, unless you're already familiar with iText, that looks a bit intimidating... well... take your pick.) Last edited by frabjous; 02-17-2010 at 04:29 PM. | 
|   |   | 
|  02-18-2010, 12:15 PM | #4 | 
| Junior Member  Posts: 2 Karma: 10 Join Date: Nov 2008 Location: Silicon Valley Device: PRS-700, G1 | 
			
			frabjous, first thank you for the speedy response. after looking at pdfstripper.jar, it does not look like the tool i was hoping for. i would have to rename my before i run it, and rename it after it is run. it appears to be a wrapper around the itext class library, that is then in turn run by a wrapper around a java tool (ant). i was hoping for something a little 'cleaner' (i.e. <command> <input pdf> <output pdf> ). (this is more of a philosophical thing. i am the lead architect on a 300k plus line java-based open source project...) i am wondering if 'pdftk flatten' is the correct tool, as it has not been updated in quite some time (nov 2006) , and might not be 'script aware' in most cases. looks like my choices are (in no order) 
 *sigh* and on a timely note, saw this article referenced on slashdot a few hours after i posted my plea for help: from the article: Computerworld -  Just hours before Adobe is slated to deliver the latest patches for its popular PDF viewer, a security firm announced that by its counting, malicious Reader documents made up 80% of all exploits at the end of 2009.looks like i need this tool more than ever ! According to ScanSafe of San Bruno, Calif., vulnerabilities in Adobe's Reader and Acrobat applications were the most frequently targeted of any software during 2009, with hackers' PDF exploits growing throughout the year. In the first quarter of 2009, malicious PDF files made up 56% of all exploits tracked by ScanSafe. That figure climbed above 60% in the second quarter, over 70% in the third and finished at 80% in the fourth quarter. -michael | 
|   |   | 
|  02-18-2010, 03:27 PM | #5 | 
| Wizard            Posts: 1,213 Karma: 12890 Join Date: Feb 2009 Location: Amherst, Massachusetts, USA Device: Sony PRS-505 | 
			
			Couldn't you automate the renaming and the rest of it with a bash script (or python/perl/sh, etc.)? Well, if you do decide to write something yourself, let us know! The other methods I mentioned are probably good enough for my purposes. What is this open source project if I may be nosy? | 
|   |   | 
| Advert | |
|  | 
|  | 
| Tags | 
| drm removal software, pdf conversion, pdf ebooks, pdf metadata, pdf password recovery | 
| 
 | 
|  Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Multiple "Copy to Library" not copying covers/opf files over? | Trickery | Calibre | 9 | 10-08-2010 01:18 PM | 
| Problem "saving to disk" pdf files | lucone | Calibre | 1 | 06-28-2010 05:29 AM | 
| Commercial program says it can "make your own pdf e-books" - Anyone know about " | Fugubot | 3 | 04-29-2009 06:39 PM | |
| I need info on the DR1000s for "searching with in" .pdf files | cs2501 | iRex | 11 | 12-25-2008 04:22 AM | 
| "Secure" PDF and "Secure" Mobi docs? | AceHarddrive | iRex | 9 | 05-08-2008 09:13 PM |