MobileRead Forums - View Single Post - Crop all pages, and remove footnote added in some?

Shohreh · 12-11-2021, 01:03 PM

Found it: You must first decompress the PDF:

Code:

mutool.exe clean -d -a original.pdf original.decompressed.pdf

This worked to replace the two string with empty strings:

Code:

# -*- coding: latin-1 -*-

from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.pdf import ContentStream
from PyPDF2.generic import TextStringObject, NameObject
from PyPDF2.utils import b_
 
string1 = "Licence  blah"
string2 = "blah blah blah"

# Load PDF for reading
source = PdfFileReader("original.decompressed.pdf")
output = PdfFileWriter()
 
# Iterating through each page
for page in range(source.getNumPages()):
	# Current Page
	print("Handling page ",page)
	page = source.getPage(page)
	content_object = page["/Contents"].getObject()
	content = ContentStream(content_object, source)
	# Iterating over all pdf elements on current page
	for operands, operator in content.operations:
		if operator == b_("Tj"):
			print("Found")
			text = operands[0]
			if isinstance(text, TextStringObject) and (text.startswith(string1) or text.startswith(string2)):
				print("Replace")
				operands[0] = TextStringObject("")
	page.__setitem__(NameObject("/Contents"), content)
	output.addPage(page)
 
outputStream = open("output.decompressed.pdf", "wb")
output.write(outputStream)

The recompressed file is ~7x bigger than the original; I also tried a couple more tools (qpdf and cpdf), but they barely did anything:

Code:

mutool.exe convert -O compress -o recompressed.pdf output.decompressed.pdf

---
Edit: GhostScript is fast and gives an even slightly smaller file than the original:

Code:

gswin32c.exe -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf