Crop all pages, and remove footnote added in some?

Shohreh · 12-08-2021, 09:02 AM

Hello,

I bought the PDF of a book that is now long out of print.

I'd like to run an application that can…
1. Crop all pages to remove empty spaces around each page
2. Remove the license string that was added to some pages.

Is there a free/open-source (qpdf, cpdf, mutool, etc.) application that could automate the process?

Thank you.

--
Edit: Tried this as an experiment, but it's not cropped

Code:

strings input.pdf | grep "Box"
/CropBox [0.0 0.0 285.36 436.8]
/MediaBox [0.0 0.0 285.36 436.8]
etc.

NOCHANGE gs -sDEVICE=pdfwrite -sOutputFile=output.pdf-dBATCH  -dNOPAUSE -c "<</ColorImageFilter /FlateEncode>> setdistillerparams" -f input.pdf -c "[ /CropBox [ 0 0 250 400] /PAGES pdfmark" -f

LibreOffice won't do because the output is a bit messed up (the OCR text in another layer shows on the side while it's hidden in the original.) From experience, LO Draw isn't very good at editing PDFs.

j.p.s · 12-08-2021, 01:20 PM

This looks very similar to your May 2020 thread
https://www.mobileread.com/forums/sh...hlight=croppdf
Did you find a solution for that?

One advantage of the croppdf script I recommended there is that you can set different margins for odd and even pages.

Shohreh · 12-08-2021, 04:27 PM

Thanks, I forgot about it.

This does the trick…

Code:

cpdf.exe -crop "0 70 340.2 462.12" input.pdf -o output.pdf

… but it obviously also removes the bottom of each page.

From what I read, a PDF is a list of objects, with an index at the end.

Is there no way for a script/application to go through that list of objects, find those that contain a given string, and remove them from the list?

The string I want to remove probably lives in the text layer that was added after running the scanned document through an OCR so the user can select/copy instead of having just a bitmap.

Shohreh · 12-11-2021, 10:04 AM

As a quick solution, I downloaded the trialware of Adobe Acrobat Pro DC, and manually removed all occurences of the offending string — Unless I missed it, that application doesn't seem to have a search+replace feature.

I'll keep investigating, though, since it could come handy.

For some reason, PyPDF2 fails finding the string:

Code:

#pip install PyPDF2
import PyPDF2, re

INPUTFILE = "input.pdf"
String = "Offending string"

object = PyPDF2.PdfFileReader(INPUTFILE)

NumPages = object.getNumPages()
for i in range(0, NumPages):
	PageObj = object.getPage(i)
	print("this is page " + str(i)) 
	Text = PageObj.extractText() 
	ResSearch = re.search(String, Text)
	print(ResSearch)

Shohreh · 12-11-2021, 01:03 PM

Found it: You must first decompress the PDF:

Code:

mutool.exe clean -d -a original.pdf original.decompressed.pdf

This worked to replace the two string with empty strings:

Code:

# -*- coding: latin-1 -*-

from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.pdf import ContentStream
from PyPDF2.generic import TextStringObject, NameObject
from PyPDF2.utils import b_
 
string1 = "Licence  blah"
string2 = "blah blah blah"

# Load PDF for reading
source = PdfFileReader("original.decompressed.pdf")
output = PdfFileWriter()
 
# Iterating through each page
for page in range(source.getNumPages()):
	# Current Page
	print("Handling page ",page)
	page = source.getPage(page)
	content_object = page["/Contents"].getObject()
	content = ContentStream(content_object, source)
	# Iterating over all pdf elements on current page
	for operands, operator in content.operations:
		if operator == b_("Tj"):
			print("Found")
			text = operands[0]
			if isinstance(text, TextStringObject) and (text.startswith(string1) or text.startswith(string2)):
				print("Replace")
				operands[0] = TextStringObject("")
	page.__setitem__(NameObject("/Contents"), content)
	output.addPage(page)
 
outputStream = open("output.decompressed.pdf", "wb")
output.write(outputStream)

The recompressed file is ~7x bigger than the original; I also tried a couple more tools (qpdf and cpdf), but they barely did anything:

Code:

mutool.exe convert -O compress -o recompressed.pdf output.decompressed.pdf

---
Edit: GhostScript is fast and gives an even slightly smaller file than the original:

Code:

gswin32c.exe -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf

12-08-2021, 09:02 AM	#1
Shohreh Addict Posts: 224 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	Crop all pages, and remove footnote added in some? Hello, I bought the PDF of a book that is now long out of print. I'd like to run an application that can… 1. Crop all pages to remove empty spaces around each page 2. Remove the license string that was added to some pages. Is there a free/open-source (qpdf, cpdf, mutool, etc.) application that could automate the process? Thank you. -- Edit: Tried this as an experiment, but it's not cropped Code: strings input.pdf \| grep "Box" /CropBox [0.0 0.0 285.36 436.8] /MediaBox [0.0 0.0 285.36 436.8] etc. NOCHANGE gs -sDEVICE=pdfwrite -sOutputFile=output.pdf-dBATCH -dNOPAUSE -c "<</ColorImageFilter /FlateEncode>> setdistillerparams" -f input.pdf -c "[ /CropBox [ 0 0 250 400] /PAGES pdfmark" -f LibreOffice won't do because the output is a bit messed up (the OCR text in another layer shows on the side while it's hidden in the original.) From experience, LO Draw isn't very good at editing PDFs. Last edited by Shohreh; 12-08-2021 at 11:34 AM.

12-08-2021, 04:27 PM	#3
Shohreh Addict Posts: 224 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	Thanks, I forgot about it. This does the trick… Code: cpdf.exe -crop "0 70 340.2 462.12" input.pdf -o output.pdf … but it obviously also removes the bottom of each page. From what I read, a PDF is a list of objects, with an index at the end. Is there no way for a script/application to go through that list of objects, find those that contain a given string, and remove them from the list? The string I want to remove probably lives in the text layer that was added after running the scanned document through an OCR so the user can select/copy instead of having just a bitmap.

12-11-2021, 10:04 AM	#4
Shohreh Addict Posts: 224 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	As a quick solution, I downloaded the trialware of Adobe Acrobat Pro DC, and manually removed all occurences of the offending string — Unless I missed it, that application doesn't seem to have a search+replace feature. I'll keep investigating, though, since it could come handy. For some reason, PyPDF2 fails finding the string: Code: #pip install PyPDF2 import PyPDF2, re INPUTFILE = "input.pdf" String = "Offending string" object = PyPDF2.PdfFileReader(INPUTFILE) NumPages = object.getNumPages() for i in range(0, NumPages): PageObj = object.getPage(i) print("this is page " + str(i)) Text = PageObj.extractText() ResSearch = re.search(String, Text) print(ResSearch)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Pages in footnote don't change - KEPUB	Buhaj	Kobo Reader	2	01-04-2015 08:52 PM
Remove Items Added During Conversion	BRGriff	Recipes	4	05-19-2011 05:14 PM
Any FREE software to crop PDF pages	droople	PDF	26	05-09-2010 03:13 PM
New Wiki Pages Added/Revised	RWood	Workshop	9	01-19-2007 04:08 PM

12-08-2021, 01:20 PM	#2
j.p.s Grand Sorcerer Posts: 5,885 Karma: 106187745 Join Date: Apr 2011 Device: pb360	This looks very similar to your May 2020 thread https://www.mobileread.com/forums/sh...hlight=croppdf Did you find a solution for that? One advantage of the croppdf script I recommended there is that you can set different margins for odd and even pages.