Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 09-03-2020, 05:34 PM   #1
MarjaE
Guru
MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.
 
Posts: 924
Karma: 53902736
Join Date: Jun 2015
Device: multiple
Splice PDF: A Script to improve readability by separating images from text

I've written a script to help with my pdf issues. Written for the bash shell in the MacOS Automator so it may require tweaks for other software.

The idea is to split each pdf in 3 parts and then splice them back together-- the cover, which I've rasterized, the images from each page, again rasterized, and the text from each page, blackened and inserted after the images. This makes it easier for me to read the text, and makes it easier for the Kindle to handle the images regardless how they've been constructed. It breaks tables of contents.

P.S. This does not work with scanned pdfs. I'd suggest using k2pdfopt -mode copy for that.

I've also written a varient with -dev dx after each k2pdfopt -mode copy, and with different output file names, for a grayscale output optimized for the Kindle Dx.

By default K2 increases contrast, so if you prefer not to, that's another tweak.

It requires Ghostscript, Cpdf, K2pdfopt, and Qpdf. Cpdf should be free for non-commercial use, but I'd still prefer an open source alternative to it, and it's no longer available via Homebrew.

I've installed k2pdfopt to ~/Applications and I've installed the others using Homebrew.

Each app seems to have slightly inconsistent standards for standard output and standard input. In the end, I instructed each one to export a set filename to a "Splice" folder, or import a set filename from there. I've been able to run the whole sequence that way, first splitting, then processing, and then splicing the pdf back together.

I haven't replaced all the older code where it used ` instead of (), maybe eventually.

for f in "$@"
do
# Copy and Rasterize 1st page from source pdf using k2pdfopt
~/Applications/k2pdfopt -ui -mode copy -p 1 -x -o "/Users/Marja/Splice/RGBCover_copy.pdf" "$f" $@
# Copy text from source pdf file using Ghostscript, turn text black using Cpdf
# The color conversion strategy should help with the 2nd stage if I switch to Ghostscript
# - and -_ indicate standard output and input
# Due to compatibility issues, dumping to ~/Splice/Text.pdf
/usr/local/bin/gs -sDEVICE=pdfwrite -dFILTERIMAGE -dFILTERVECTOR -dCompatibilityLevel=1.4 -sColorConversionStrategy=RGB -sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile="/Users/Marja/Splice/Text.pdf" "$f" &&
/usr/local/bin/cpdf "/Users/Marja/Splice/Text.pdf" -blacktext -o "/Users/Marja/Splice/Blacktext.pdf"
# Copy images from same source pdf file using Ghostscript, rasterize images using K2pdfopt
# Due to compatibility issues, dumping to ~/Splice/Images.pdf
/usr/local/bin/gs -sDEVICE=pdfimage24 -dFILTERTEXT -dCompatibilityLevel=1.4\
-g800x1080 -r150 -dPDFFitPage \
-sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile="/Users/Marja/Splice/Images.pdf" "$f" &&
~/Applications/k2pdfopt -ui -mode copy -x -o "/Users/Marja/Splice/RGBImages_copy.pdf" "/Users/Marja/Splice/Images.pdf" $@ &&
# Splice files using qpdf
suffix="-SplicedColor.pdf"
base=`basename "$f" .pdf`
outputfile=$base$suffix
/usr/local/bin/qpdf --collate "/Users/Marja/Splice/RGBCover_copy.pdf" --pages "/Users/Marja/Splice/RGBCover_copy.pdf" "/Users/Marja/Splice/RGBImages_copy.pdf" "/Users/Marja/Splice/Blacktext.pdf" -- "$outputfile"
done

Last edited by MarjaE; 09-03-2020 at 05:45 PM.
MarjaE is offline   Reply With Quote
Old 09-03-2020, 05:37 PM   #2
MarjaE
Guru
MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.
 
Posts: 924
Karma: 53902736
Join Date: Jun 2015
Device: multiple
If anyone with more programming experience wants to rework this, feel free. A platform-independent and cpdf-independent version would be useful.
MarjaE is offline   Reply With Quote
Advert
Old 09-03-2020, 06:24 PM   #3
j.p.s
Grand Sorcerer
j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.
 
Posts: 5,262
Karma: 98804578
Join Date: Apr 2011
Device: pb360
Quote:
Originally Posted by MarjaE View Post
If anyone with more programming experience wants to rework this, feel free. A platform-independent and cpdf-independent version would be useful.
As an initial reaction, I suggest adding the following to the top of the script:
Code:
export K2PDFOPT_HOME=~/Applications
export OUTDIR=/Users/Marja/Splice
and replace all instances of "~/Applications" with "$K2PDFOPT_HOME"
and of /Users/Marja/Splice with "$OUTDIR"
(or any names you prefer). That way others (and you) only have to edit a couple of lines at the top to make a change in location of k2pdfopt and output directory.
j.p.s is offline   Reply With Quote
Old 09-03-2020, 06:38 PM   #4
MarjaE
Guru
MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.
 
Posts: 924
Karma: 53902736
Join Date: Jun 2015
Device: multiple
Thank you!

P.S. I'm having some trouble with the broken tables of contents and with broken scaling.

A quick test shows that k2pdfopt -mode copy -n -toc- can cut the table of contents, but not correct the scaling. A Quartz filter can cut and correct, but it's platform-specific and doubles up the text. MuTool Clean can't cut or correct these. Printing to a new pdf should have much the same effect as running through a Quartz filter.

P.P.S. Also running Mutool clean -d -s -z at the end of the process scrambles some text by writing one line over another. But -g -g -g doesn't seem to cause trouble. Known bug with -s: https://bugs.ghostscript.com/show_bug.cgi?id=702715

P.P.P.S. Removing text from the image pages is hit-and-miss. I suspect k2 is starting before gs has finished. So I am looking at restructuring the script to (a) run a Quartz filter at the beginning, even if it's Mac-specific, (b) then run the Ghostscript stages, (c) then cpdf and k2, and (d) finally run qpdf.

Last edited by MarjaE; 09-04-2020 at 02:07 AM.
MarjaE is offline   Reply With Quote
Old 09-04-2020, 11:57 PM   #5
MarjaE
Guru
MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.
 
Posts: 924
Karma: 53902736
Join Date: Jun 2015
Device: multiple
A Mac-specific implementation, optimizing for the Kindle Dx. It works in Mojave. I'm not sure if it will work in Catalina due to Apple's ongoing cuts to Automator:

1. Install BenWiggy's PDFsuite, pypy, pyobjc for python 2, ghostscript, k2pdfopt, cpdf, and qpdf.

2. Open Automator and create a new App.

3. Run Shell script, 7 times, using Bash, and passing input as arguments. By splitting this into 7 shells scripts, we can help make sure the Mac finishes each step before starting the next. You'll need to substitute your preferred location for your K2pdfopt app, for some other apps, and for your Splice folder. I don't think the export code above will be suitable with so many short scripts.

for f in "$@"
do
# Strip any table of contents and fit text to page sizes to avoid ay scaling issues
/usr/local/bin/python /Users/Marja/Library/Services/quartzfilter.py "$f" "/System/Library/Filters/Lightness Increase.qfilter "/Users/Marja/Splice/Light.pdf"
done

for f in "$@"
do
# Copy and Rasterize 1st page from source pdf using k2pdfopt
~/Applications/k2pdfopt -ui -mode copy -dev dx -p 1 -x -o "/Users/Marja/Splice/DxCover_dx.pdf" "/Users/Marja/Splice/Light.pdf" $@
done

for f in "$@"
do
# Copy images from same source pdf file using Ghostscript, rasterize images using K2pdfopt
# - and -_ indicate standard output and input
# Due to compatibility issues, dumping to ~/Splice/Images.pdf
/usr/local/bin/gs -sDEVICE=pdfimage24 -dFILTERTEXT -dCompatibilityLevel=1.4\
-g800x1080 -r150 -dPDFFitPage \
-sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile="/Users/Marja/Splice/Images.pdf" "/Users/Marja/Splice/Light.pdf"
done

for f in "$@"
do
# Copy text from source pdf file using Ghostscript, turn text black using Cpdf
# The color conversion strategy should help with the 2nd stage if I switch to Ghostscript
# Due to compatibility issues, dumping to ~/Splice/Text.pdf
/usr/local/bin/gs -sDEVICE=pdfwrite -dFILTERIMAGE -dFILTERVECTOR -dCompatibilityLevel=1.4 -sColorConversionStrategy=RGB -sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile="/Users/Marja/Splice/Text.pdf" "/Users/Marja/Splice/Light.pdf"
done

for f in "$@"
do
# Copy images from same source pdf file using Ghostscript, rasterize images using K2pdfopt
# - and -_ indicate standard output and input
# Due to compatibility issues, dumping to ~/Splice/Images.pdf
~/Applications/k2pdfopt -ui -mode copy -dev dx -x -o "/Users/Marja/Splice/DxImages_dx.pdf" "/Users/Marja/Splice/Images.pdf" $@
done

for f in "$@"
do
# Copy text from source pdf file using Ghostscript, turn text black using Cpdf
# The color conversion strategy should help with the 2nd stage if I switch to Ghostscript
# Due to compatibility issues, dumping to ~/Splice/Text.pdf
/usr/local/bin/cpdf "/Users/Marja/Splice/Text.pdf" -blacktext -o "/Users/Marja/Splice/Blacktext.pdf"
done

for f in "$@"
do
# Splice files using qpdf and date so new runs won't overwrite old ones
/usr/local/bin/qpdf --collate "/Users/Marja/Splice/DxCover_dx.pdf" --pages "/Users/Marja/Splice/DxCover_dx.pdf" "/Users/Marja/Splice/DxImages_dx.pdf" "/Users/Marja/Splice/Blacktext.pdf" -- /Users/Marja/Splice/"SplicedDx$(date "+%Y.%m.%d-%H.%M.%S").pdf"
done

The 3rd shell script can take a long while.

I've experimented with the PDFSuite 150 and 300 dpi filters, but depending on the source pdfs these often crash due to memory pressure. Even this version will occasionally crash.

I've not been able to keep the original filename as an element in the final one.
MarjaE is offline   Reply With Quote
Advert
Old 09-09-2020, 11:05 PM   #6
MarjaE
Guru
MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.
 
Posts: 924
Karma: 53902736
Join Date: Jun 2015
Device: multiple
P.S. Using a single bash shell, with wait in a separate line between every 2 other commands, works better than multiple shells.
MarjaE is offline   Reply With Quote
Old 10-03-2020, 01:10 AM   #7
MarjaE
Guru
MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.
 
Posts: 924
Karma: 53902736
Join Date: Jun 2015
Device: multiple
Here's an updated Mac implementation. It differs from my 1st draft in 3 respects:

1. It adds a Quartz step at the beginning, to reduce the risk of losing text information, and of sizing issues.

2. It adds the wait steps.

3. It adds another k2pdfopt step at the end, to remove the now-useless tables of contents.

I don't have the programming knowledge for j/p.s.'s suggestions.

Requires Automator with a shell script, using bash, and passing input as arguments (Mac-specific), Quartz (Mac-specific, but other apps may accomplish the same goals in Linux and Windows), Python 3, a couple scripts from Benwiggy's PDFSuite edited to work with Python 3, ghostscript, Willus's k2pdfopt, cpdf, and qpdf.

for f in "$@"
do
# Strip any table of contents and fit text to page sizes to avoid any scaling issues
/usr/local/bin/python3 /Users/Marja/Library/Services/quartzfilter3.py "$f" "/Users/Marja/Library/Filters/Generic RGB.qfilter" "/Users/Marja/Splice/GRGB.pdf"
wait
# Copy and Rasterize 1st page from source pdf using k2pdfopt
~/Applications/k2pdfopt -ui -mode copy -p 1 -x -o "/Users/Marja/Splice/Cover_rgb.pdf" "/Users/Marja/Splice/GRGB.pdf" $@
wait
# Copy images from same source pdf file using Ghostscript, rasterize images using K2pdfopt
# Due to compatibility issues, dumping to ~/Splice/Images.pdf
/usr/local/bin/gs -sDEVICE=pdfimage24 -dFILTERTEXT -dCompatibilityLevel=1.4\
-g800x1080 -r150 -dPDFFitPage \
-sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile="/Users/Marja/Splice/Images.pdf" "/Users/Marja/Splice/GRGB.pdf"
wait
~/Applications/k2pdfopt -ui -mode copy -x -o "/Users/Marja/Splice/Images_rgb.pdf" "/Users/Marja/Splice/Images.pdf" $@
wait
# Copy text from source pdf file using Ghostscript, turn text black using Cpdf
# The color conversion strategy should help with the 2nd stage if I switch to Ghostscript
# - and -_ indicate standard output and input
# Due to compatibility issues, dumping to ~/Splice/Text.pdf
/usr/local/bin/gs -sDEVICE=pdfwrite -dFILTERIMAGE -dFILTERVECTOR -dCompatibilityLevel=1.4 -sColorConversionStrategy=RGB -sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile="/Users/Marja/Splice/Text.pdf" "/Users/Marja/Splice/GRGB.pdf"
wait
/usr/local/bin/cpdf "/Users/Marja/Splice/Text.pdf" -blacktext -o "/Users/Marja/Splice/Blacktext.pdf"
wait
# Splice files using qpdf
/usr/local/bin/qpdf --collate "/Users/Marja/Splice/Cover_rgb.pdf" --pages "/Users/Marja/Splice/Cover_rgb.pdf" "/Users/Marja/Splice/Images_rgb.pdf" "/Users/Marja/Splice/Blacktext.pdf" -- /Users/Marja/Splice/SplicedRGB.pdf
wait
# Remove any table of contents, since it won't fit the spliced pdf
suffix="-SplicedRgbG.pdf"
base=`basename "$f" .pdf`
outputfile=$base$suffix
~/Applications/k2pdfopt -ui -mode copy -n -toc- -o /Users/Marja/Splice/"$outputfile" /Users/Marja/Splice/SplicedRGB.pdf $@
done
MarjaE is offline   Reply With Quote
Old 11-07-2020, 03:01 PM   #8
MarjaE
Guru
MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.MarjaE ought to be getting tired of karma fortunes by now.
 
Posts: 924
Karma: 53902736
Join Date: Jun 2015
Device: multiple
It helps to drop $@ where I've included it above.
MarjaE is offline   Reply With Quote
Reply

Tags
pdf

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
pdf preserved format tables, images, text... r728626 Conversion 3 10-26-2017 08:55 AM
no text extraction for pdf with images and OCR fxp33 Conversion 7 12-15-2015 07:22 AM
PDF to Mobi with text and images pocketsprocket Kindle Formats 7 05-21-2012 07:06 AM
pdf to mobi... creating images rather than text Dumhed Calibre 5 11-06-2010 12:08 PM
PDF to Epub - Images with Text ebahm Calibre 2 09-19-2010 03:23 PM


All times are GMT -4. The time now is 07:03 AM.


MobileRead.com is a privately owned, operated and funded community.