| 
			
			 | 
		#1 | 
| 
			
			
			
			 Addict 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 222 
				Karma: 304158 
				Join Date: Jan 2016 
				Location: France 
				
				
				Device: none 
				
				
				 | 
	
	
	
		
		
			
			 
			
			Hello, 
		
	
		
		
		
		
		
		
		
		
		
		
		
			I bought the PDF of a book that is now long out of print. I'd like to run an application that can… 1. Crop all pages to remove empty spaces around each page 2. Remove the license string that was added to some pages. Is there a free/open-source (qpdf, cpdf, mutool, etc.) application that could automate the process? Thank you. ![]() -- Edit: Tried this as an experiment, but it's not cropped Code: 
	strings input.pdf | grep "Box" /CropBox [0.0 0.0 285.36 436.8] /MediaBox [0.0 0.0 285.36 436.8] etc. NOCHANGE gs -sDEVICE=pdfwrite -sOutputFile=output.pdf-dBATCH -dNOPAUSE -c "<</ColorImageFilter /FlateEncode>> setdistillerparams" -f input.pdf -c "[ /CropBox [ 0 0 250 400] /PAGES pdfmark" -f Last edited by Shohreh; 12-08-2021 at 11:34 AM.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#2 | 
| 
			
			
			
			 Grand Sorcerer 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,842 
				Karma: 105494725 
				Join Date: Apr 2011 
				
				
				
				Device: pb360 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			This looks very similar to your May 2020 thread 
		
	
		
		
		
		
		
		
		
		
		
		
	
	https://www.mobileread.com/forums/sh...hlight=croppdf Did you find a solution for that? One advantage of the croppdf script I recommended there is that you can set different margins for odd and even pages.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#3 | 
| 
			
			
			
			 Addict 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 222 
				Karma: 304158 
				Join Date: Jan 2016 
				Location: France 
				
				
				Device: none 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Thanks, I forgot about it. 
		
	
		
		
		
		
		
		
		
		
		
		
	
	This does the trick… Code: 
	cpdf.exe -crop "0 70 340.2 462.12" input.pdf -o output.pdf From what I read, a PDF is a list of objects, with an index at the end. Is there no way for a script/application to go through that list of objects, find those that contain a given string, and remove them from the list? The string I want to remove probably lives in the text layer that was added after running the scanned document through an OCR so the user can select/copy instead of having just a bitmap.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#4 | 
| 
			
			
			
			 Addict 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 222 
				Karma: 304158 
				Join Date: Jan 2016 
				Location: France 
				
				
				Device: none 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			As a quick solution, I downloaded the trialware of Adobe Acrobat Pro DC, and manually removed all occurences of the offending string — Unless I missed it, that application doesn't seem to have a search+replace feature. 
		
	
		
		
		
		
		
		
		
		
		
		
	
	I'll keep investigating, though, since it could come handy. For some reason, PyPDF2 fails finding the string: Code: 
	#pip install PyPDF2
import PyPDF2, re
INPUTFILE = "input.pdf"
String = "Offending string"
object = PyPDF2.PdfFileReader(INPUTFILE)
NumPages = object.getNumPages()
for i in range(0, NumPages):
	PageObj = object.getPage(i)
	print("this is page " + str(i)) 
	Text = PageObj.extractText() 
	ResSearch = re.search(String, Text)
	print(ResSearch)
 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#5 | 
| 
			
			
			
			 Addict 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 222 
				Karma: 304158 
				Join Date: Jan 2016 
				Location: France 
				
				
				Device: none 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Found it: You must first decompress the PDF: 
		
	
		
		
		
		
		
		
		
		
		
		
		
			Code: 
	mutool.exe clean -d -a original.pdf original.decompressed.pdf Code: 
	# -*- coding: latin-1 -*-
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.pdf import ContentStream
from PyPDF2.generic import TextStringObject, NameObject
from PyPDF2.utils import b_
 
string1 = "Licence  blah"
string2 = "blah blah blah"
# Load PDF for reading
source = PdfFileReader("original.decompressed.pdf")
output = PdfFileWriter()
 
# Iterating through each page
for page in range(source.getNumPages()):
	# Current Page
	print("Handling page ",page)
	page = source.getPage(page)
	content_object = page["/Contents"].getObject()
	content = ContentStream(content_object, source)
	# Iterating over all pdf elements on current page
	for operands, operator in content.operations:
		if operator == b_("Tj"):
			print("Found")
			text = operands[0]
			if isinstance(text, TextStringObject) and (text.startswith(string1) or text.startswith(string2)):
				print("Replace")
				operands[0] = TextStringObject("")
	page.__setitem__(NameObject("/Contents"), content)
	output.addPage(page)
 
outputStream = open("output.decompressed.pdf", "wb")
output.write(outputStream)
Code: 
	mutool.exe convert -O compress -o recompressed.pdf output.decompressed.pdf Edit: GhostScript is fast and gives an even slightly smaller file than the original: Code: 
	gswin32c.exe -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf Last edited by Shohreh; 12-11-2021 at 06:30 PM.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
![]()  | 
            
        
    
            
  | 
    
			 
			Similar Threads
		 | 
	||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Pages in footnote don't change - KEPUB | Buhaj | Kobo Reader | 2 | 01-04-2015 08:52 PM | 
| Remove Items Added During Conversion | BRGriff | Recipes | 4 | 05-19-2011 05:14 PM | 
| Any FREE software to crop PDF pages | droople | 26 | 05-09-2010 03:13 PM | |
| New Wiki Pages Added/Revised | RWood | Workshop | 9 | 01-19-2007 04:08 PM |