12-08-2021, 08:02 AM | #1 |
Zealot
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
|
Crop all pages, and remove footnote added in some?
Hello,
I bought the PDF of a book that is now long out of print. I'd like to run an application that can… 1. Crop all pages to remove empty spaces around each page 2. Remove the license string that was added to some pages. Is there a free/open-source (qpdf, cpdf, mutool, etc.) application that could automate the process? Thank you. -- Edit: Tried this as an experiment, but it's not cropped Code:
strings input.pdf | grep "Box" /CropBox [0.0 0.0 285.36 436.8] /MediaBox [0.0 0.0 285.36 436.8] etc. NOCHANGE gs -sDEVICE=pdfwrite -sOutputFile=output.pdf-dBATCH -dNOPAUSE -c "<</ColorImageFilter /FlateEncode>> setdistillerparams" -f input.pdf -c "[ /CropBox [ 0 0 250 400] /PAGES pdfmark" -f Last edited by Shohreh; 12-08-2021 at 10:34 AM. |
12-08-2021, 12:20 PM | #2 |
Grand Sorcerer
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
|
This looks very similar to your May 2020 thread
https://www.mobileread.com/forums/sh...hlight=croppdf Did you find a solution for that? One advantage of the croppdf script I recommended there is that you can set different margins for odd and even pages. |
12-08-2021, 03:27 PM | #3 |
Zealot
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
|
Thanks, I forgot about it.
This does the trick… Code:
cpdf.exe -crop "0 70 340.2 462.12" input.pdf -o output.pdf From what I read, a PDF is a list of objects, with an index at the end. Is there no way for a script/application to go through that list of objects, find those that contain a given string, and remove them from the list? The string I want to remove probably lives in the text layer that was added after running the scanned document through an OCR so the user can select/copy instead of having just a bitmap. |
12-11-2021, 09:04 AM | #4 |
Zealot
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
|
As a quick solution, I downloaded the trialware of Adobe Acrobat Pro DC, and manually removed all occurences of the offending string — Unless I missed it, that application doesn't seem to have a search+replace feature.
I'll keep investigating, though, since it could come handy. For some reason, PyPDF2 fails finding the string: Code:
#pip install PyPDF2 import PyPDF2, re INPUTFILE = "input.pdf" String = "Offending string" object = PyPDF2.PdfFileReader(INPUTFILE) NumPages = object.getNumPages() for i in range(0, NumPages): PageObj = object.getPage(i) print("this is page " + str(i)) Text = PageObj.extractText() ResSearch = re.search(String, Text) print(ResSearch) |
12-11-2021, 12:03 PM | #5 |
Zealot
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
|
Found it: You must first decompress the PDF:
Code:
mutool.exe clean -d -a original.pdf original.decompressed.pdf Code:
# -*- coding: latin-1 -*- from PyPDF2 import PdfFileReader, PdfFileWriter from PyPDF2.pdf import ContentStream from PyPDF2.generic import TextStringObject, NameObject from PyPDF2.utils import b_ string1 = "Licence blah" string2 = "blah blah blah" # Load PDF for reading source = PdfFileReader("original.decompressed.pdf") output = PdfFileWriter() # Iterating through each page for page in range(source.getNumPages()): # Current Page print("Handling page ",page) page = source.getPage(page) content_object = page["/Contents"].getObject() content = ContentStream(content_object, source) # Iterating over all pdf elements on current page for operands, operator in content.operations: if operator == b_("Tj"): print("Found") text = operands[0] if isinstance(text, TextStringObject) and (text.startswith(string1) or text.startswith(string2)): print("Replace") operands[0] = TextStringObject("") page.__setitem__(NameObject("/Contents"), content) output.addPage(page) outputStream = open("output.decompressed.pdf", "wb") output.write(outputStream) Code:
mutool.exe convert -O compress -o recompressed.pdf output.decompressed.pdf Edit: GhostScript is fast and gives an even slightly smaller file than the original: Code:
gswin32c.exe -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf Last edited by Shohreh; 12-11-2021 at 05:30 PM. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Pages in footnote don't change - KEPUB | Buhaj47 | Kobo Reader | 2 | 01-04-2015 07:52 PM |
Remove Items Added During Conversion | BRGriff | Recipes | 4 | 05-19-2011 04:14 PM |
Any FREE software to crop PDF pages | droople | 26 | 05-09-2010 02:13 PM | |
New Wiki Pages Added/Revised | RWood | Workshop | 9 | 01-19-2007 03:08 PM |