Found it: You must first decompress the PDF:
Code:
mutool.exe clean -d -a original.pdf original.decompressed.pdf
This worked to replace the two string with empty strings:
Code:
# -*- coding: latin-1 -*-
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.pdf import ContentStream
from PyPDF2.generic import TextStringObject, NameObject
from PyPDF2.utils import b_
string1 = "Licence blah"
string2 = "blah blah blah"
# Load PDF for reading
source = PdfFileReader("original.decompressed.pdf")
output = PdfFileWriter()
# Iterating through each page
for page in range(source.getNumPages()):
# Current Page
print("Handling page ",page)
page = source.getPage(page)
content_object = page["/Contents"].getObject()
content = ContentStream(content_object, source)
# Iterating over all pdf elements on current page
for operands, operator in content.operations:
if operator == b_("Tj"):
print("Found")
text = operands[0]
if isinstance(text, TextStringObject) and (text.startswith(string1) or text.startswith(string2)):
print("Replace")
operands[0] = TextStringObject("")
page.__setitem__(NameObject("/Contents"), content)
output.addPage(page)
outputStream = open("output.decompressed.pdf", "wb")
output.write(outputStream)
The recompressed file is ~7x bigger than the original; I also tried a couple more tools (qpdf and cpdf), but they barely did anything:
Code:
mutool.exe convert -O compress -o recompressed.pdf output.decompressed.pdf
---
Edit:
GhostScript is fast and gives an even slightly smaller file than the original:
Code:
gswin32c.exe -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf