MobileRead Forums - View Single Post

MarjaE · 03-13-2018, 12:15 PM

Hi,

I'm still working out how to handle pdfs. But I've had a lot of trial and error and I'd like to share.

First: Keep your originals. If you don't have enough disk space, I'd suggest storing some on an external drive, and setting Time Machine to back up the external drive as well as the main drive.

Second: Many pdfs encode images as jpeg2000. It takes less space than jpeg, but some Macs will take longer to load pages from these pdfs, and Kindle 2s and other older readers won't be able to load images. It can be particularly bad with scanned pdfs, where it means the older readers won't be able to load anything. I use Easyfind and search for "jpxdecode" in file contents, to identify files with jpeg2000 and other jpx images.

Apple changed their Quartz decoder in Sierra, so it's much more reluctant to convert jpeg2000 images in pdf files to jpeg images. You'll need other tools if you want to convert jpeg2000 images in pdf files to jpeg images.

My suggestions:

-- Willus's k2pdfopt-- http://www.willus.com/k2pdfopt/

-- Homebrew-- https://brew.sh/ unless you use MacPorts instead.

-- Ghostscript-- can be installed through Homebrew

-- rwts-pdfwriter-- https://github.com/rodyager/RWTS-PDFwriter

-- cpdf-- can be installed through Homebrew

-- ocrmypdf-- can be installed through Homebrew-- you may need to brew uninstall Tesseract and brew install --all-languages tesseract

-- Automator-- comes with your computer and can help avoid typing and retyping terminal commands.

My workflow, more or less:

First, do I need to ocr the text? That's often the case with scanned texts, and occasionally with other texts due to text encoding errors.

If I need to ocr the text, then I need to use either ocrmypdf or Elucidate. For whatever reason, the resulting files don't play well with Ghostscript, so I will need to use k2pdfopt on them.

If I don't need to ocr the text, then is it raster or vector? is any text pixelated?

If it's raster, and I don't mind more pixellation, don't mind losing colors, and don't mind resetting fold-out pages to the same size as other pages, then I can use k2pdfopt with decent compression.

If it's raster, and I do mind, I can use k2pdfopt without compression or ghostscript converting to pdf 1.4.

If it's vector, I suggest ghostscript converting to pdf 1.4.

My command-line codes:

For ocring text:

ocrmypdf -l lan --force-ocr input.pdf output.pdf

-l lan allows a 3-letter code to specify the language. If you skip this, it defaults to English.

--force-ocr overwrites existing text layers. If the file has a Google Books intro but no text layer afterwards, or the files has a bad text layer, this is useful.

input.pdf I tend to drag and drop from the Finder into the terminal window.

output.pdf It should appear in your user folder.

For k2pdfopt with compression:

k2pdfopt -mode copy -dev dx
input.pdf

-dev dx sets it to reformat everything for the Kindle dx. There are other codes for some other devices.

I hit enter after the codes here, and then drag and drop the input file into the k2 window.

The customization tools here are handy: http://www.willus.com/k2pdfopt/help/mac.shtml

For k2pdfopt without compression:

k2pdfopt -mode copy

I hit enter after the codes here, and then drag and drop the input file into the k2 window.

The customization tools here are handy: http://www.willus.com/k2pdfopt/help/mac.shtml

For ghostscript to convert:

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf

Output should appear in your user folder.

Modified from instructions here: http://www.spoonylife.org/level-3/co...to-1-5-1-6-etc

For Automator:

I haven't figured out how to use Automator with the other tools yet, but I use it to simplify that Ghostscript script.

I created an app with a single step: run shell script. "shell" is "/bin/bash" and "pass input" is "as arguments"; the actual code is:

for f in "$@"
do
suffix="-converted.pdf"
base=`basename "$f" .pdf`
outputfile=$base$suffix
/usr/local/bin/gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -sstdout=%sstderr -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile="$outputfile" "$f"
done

I can just drag files onto the app icon and Ghostscript converts them to 1.4, converting any jpeg2000 images to jpeg.

Output should appear in your user folder.

Anyway, I hope this helps.

03-13-2018, 12:15 PM	#1
MarjaE Guru Posts: 942 Karma: 53902736 Join Date: Jun 2015 Device: multiple	Mac to Kindle 2 and Other Older Readers Hi, I'm still working out how to handle pdfs. But I've had a lot of trial and error and I'd like to share. First: Keep your originals. If you don't have enough disk space, I'd suggest storing some on an external drive, and setting Time Machine to back up the external drive as well as the main drive. Second: Many pdfs encode images as jpeg2000. It takes less space than jpeg, but some Macs will take longer to load pages from these pdfs, and Kindle 2s and other older readers won't be able to load images. It can be particularly bad with scanned pdfs, where it means the older readers won't be able to load anything. I use Easyfind and search for "jpxdecode" in file contents, to identify files with jpeg2000 and other jpx images. Apple changed their Quartz decoder in Sierra, so it's much more reluctant to convert jpeg2000 images in pdf files to jpeg images. You'll need other tools if you want to convert jpeg2000 images in pdf files to jpeg images. My suggestions: -- Willus's k2pdfopt-- http://www.willus.com/k2pdfopt/ -- Homebrew-- https://brew.sh/ unless you use MacPorts instead. -- Ghostscript-- can be installed through Homebrew -- rwts-pdfwriter-- https://github.com/rodyager/RWTS-PDFwriter -- cpdf-- can be installed through Homebrew -- ocrmypdf-- can be installed through Homebrew-- you may need to brew uninstall Tesseract and brew install --all-languages tesseract -- Automator-- comes with your computer and can help avoid typing and retyping terminal commands. My workflow, more or less: First, do I need to ocr the text? That's often the case with scanned texts, and occasionally with other texts due to text encoding errors. If I need to ocr the text, then I need to use either ocrmypdf or Elucidate. For whatever reason, the resulting files don't play well with Ghostscript, so I will need to use k2pdfopt on them. If I don't need to ocr the text, then is it raster or vector? is any text pixelated? If it's raster, and I don't mind more pixellation, don't mind losing colors, and don't mind resetting fold-out pages to the same size as other pages, then I can use k2pdfopt with decent compression. If it's raster, and I do mind, I can use k2pdfopt without compression or ghostscript converting to pdf 1.4. If it's vector, I suggest ghostscript converting to pdf 1.4. My command-line codes: For ocring text: ocrmypdf -l lan --force-ocr input.pdf output.pdf -l lan allows a 3-letter code to specify the language. If you skip this, it defaults to English. --force-ocr overwrites existing text layers. If the file has a Google Books intro but no text layer afterwards, or the files has a bad text layer, this is useful. input.pdf I tend to drag and drop from the Finder into the terminal window. output.pdf It should appear in your user folder. For k2pdfopt with compression: k2pdfopt -mode copy -dev dx input.pdf -dev dx sets it to reformat everything for the Kindle dx. There are other codes for some other devices. I hit enter after the codes here, and then drag and drop the input file into the k2 window. The customization tools here are handy: http://www.willus.com/k2pdfopt/help/mac.shtml For k2pdfopt without compression: k2pdfopt -mode copy I hit enter after the codes here, and then drag and drop the input file into the k2 window. The customization tools here are handy: http://www.willus.com/k2pdfopt/help/mac.shtml For ghostscript to convert: gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf Output should appear in your user folder. Modified from instructions here: http://www.spoonylife.org/level-3/co...to-1-5-1-6-etc For Automator: I haven't figured out how to use Automator with the other tools yet, but I use it to simplify that Ghostscript script. I created an app with a single step: run shell script. "shell" is "/bin/bash" and "pass input" is "as arguments"; the actual code is: for f in "$@" do suffix="-converted.pdf" base=`basename "$f" .pdf` outputfile=$base$suffix /usr/local/bin/gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -sstdout=%sstderr -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile="$outputfile" "$f" done I can just drag files onto the app icon and Ghostscript converts them to 1.4, converting any jpeg2000 images to jpeg. Output should appear in your user folder. Anyway, I hope this helps. Last edited by MarjaE; 03-13-2018 at 12:17 PM.