Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 12-08-2021, 08:02 AM   #1
Shohreh
Zealot
Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.
 
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
Question Crop all pages, and remove footnote added in some?

Hello,

I bought the PDF of a book that is now long out of print.

I'd like to run an application that can…
1. Crop all pages to remove empty spaces around each page
2. Remove the license string that was added to some pages.

Is there a free/open-source (qpdf, cpdf, mutool, etc.) application that could automate the process?

Thank you.



--
Edit: Tried this as an experiment, but it's not cropped

Code:
strings input.pdf | grep "Box"
/CropBox [0.0 0.0 285.36 436.8]
/MediaBox [0.0 0.0 285.36 436.8]
etc.

NOCHANGE gs -sDEVICE=pdfwrite -sOutputFile=output.pdf-dBATCH  -dNOPAUSE -c "<</ColorImageFilter /FlateEncode>> setdistillerparams" -f input.pdf -c "[ /CropBox [ 0 0 250 400] /PAGES pdfmark" -f
LibreOffice won't do because the output is a bit messed up (the OCR text in another layer shows on the side while it's hidden in the original.) From experience, LO Draw isn't very good at editing PDFs.

Last edited by Shohreh; 12-08-2021 at 10:34 AM.
Shohreh is offline   Reply With Quote
Old 12-08-2021, 12:20 PM   #2
j.p.s
Grand Sorcerer
j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.
 
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
This looks very similar to your May 2020 thread
https://www.mobileread.com/forums/sh...hlight=croppdf
Did you find a solution for that?

One advantage of the croppdf script I recommended there is that you can set different margins for odd and even pages.
j.p.s is offline   Reply With Quote
Old 12-08-2021, 03:27 PM   #3
Shohreh
Zealot
Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.
 
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
Thanks, I forgot about it.

This does the trick…
Code:
cpdf.exe -crop "0 70 340.2 462.12" input.pdf -o output.pdf
… but it obviously also removes the bottom of each page.

From what I read, a PDF is a list of objects, with an index at the end.

Is there no way for a script/application to go through that list of objects, find those that contain a given string, and remove them from the list?

The string I want to remove probably lives in the text layer that was added after running the scanned document through an OCR so the user can select/copy instead of having just a bitmap.
Shohreh is offline   Reply With Quote
Old 12-11-2021, 09:04 AM   #4
Shohreh
Zealot
Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.
 
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
As a quick solution, I downloaded the trialware of Adobe Acrobat Pro DC, and manually removed all occurences of the offending string — Unless I missed it, that application doesn't seem to have a search+replace feature.

I'll keep investigating, though, since it could come handy.

For some reason, PyPDF2 fails finding the string:
Code:
#pip install PyPDF2
import PyPDF2, re

INPUTFILE = "input.pdf"
String = "Offending string"

object = PyPDF2.PdfFileReader(INPUTFILE)

NumPages = object.getNumPages()
for i in range(0, NumPages):
	PageObj = object.getPage(i)
	print("this is page " + str(i)) 
	Text = PageObj.extractText() 
	ResSearch = re.search(String, Text)
	print(ResSearch)
Shohreh is offline   Reply With Quote
Old 12-11-2021, 12:03 PM   #5
Shohreh
Zealot
Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.
 
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
Found it: You must first decompress the PDF:

Code:
mutool.exe clean -d -a original.pdf original.decompressed.pdf
This worked to replace the two string with empty strings:

Code:
# -*- coding: latin-1 -*-

from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.pdf import ContentStream
from PyPDF2.generic import TextStringObject, NameObject
from PyPDF2.utils import b_
 
string1 = "Licence  blah"
string2 = "blah blah blah"

# Load PDF for reading
source = PdfFileReader("original.decompressed.pdf")
output = PdfFileWriter()
 
# Iterating through each page
for page in range(source.getNumPages()):
	# Current Page
	print("Handling page ",page)
	page = source.getPage(page)
	content_object = page["/Contents"].getObject()
	content = ContentStream(content_object, source)
	# Iterating over all pdf elements on current page
	for operands, operator in content.operations:
		if operator == b_("Tj"):
			print("Found")
			text = operands[0]
			if isinstance(text, TextStringObject) and (text.startswith(string1) or text.startswith(string2)):
				print("Replace")
				operands[0] = TextStringObject("")
	page.__setitem__(NameObject("/Contents"), content)
	output.addPage(page)
 
outputStream = open("output.decompressed.pdf", "wb")
output.write(outputStream)
The recompressed file is ~7x bigger than the original; I also tried a couple more tools (qpdf and cpdf), but they barely did anything:

Code:
mutool.exe convert -O compress -o recompressed.pdf output.decompressed.pdf
---
Edit: GhostScript is fast and gives an even slightly smaller file than the original:

Code:
gswin32c.exe -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf

Last edited by Shohreh; 12-11-2021 at 05:30 PM.
Shohreh is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Pages in footnote don't change - KEPUB Buhaj47 Kobo Reader 2 01-04-2015 07:52 PM
Remove Items Added During Conversion BRGriff Recipes 4 05-19-2011 04:14 PM
Any FREE software to crop PDF pages droople PDF 26 05-09-2010 02:13 PM
New Wiki Pages Added/Revised RWood Workshop 9 01-19-2007 03:08 PM


All times are GMT -4. The time now is 06:21 AM.


MobileRead.com is a privately owned, operated and funded community.