Quote:
Originally Posted by theducks
OCR'd is text.
Image only PDF's may need to be OCR'd.
So is your question: "how do I differentiate Text PDF's from Image PDF's?"
|
Exactly!
Quote:
Originally Posted by theducks
I would expect that a Image filled PDF file might be quite a bit larger.
|
Yes, that's certainly one way to guess, but I have text-only PDFs in my library that are 16 MB or more, and image ones that are less than that, so, since it seems Calibre can't do this itself, I wrote a BASH script in Linux that uses the command pdfimages:
Code:
#!/bin/bash
# This script will find all PDFs lacking images in a Calibre library
#
# Run it with this:
# find ~/Calibre\ Library/ -iname "*.pdf" -print0 | xargs -0 -I{} ./pdf_no_images.bash {} 2> /dev/null > "PDFs lacking images.txt"
images=`pdfimages -list "$1" | awk '{print $2}' | grep 0`
if [ -z "$images" ]; then
echo "$1"
fi
It works well. The only problem is that I have to manually go through the "PDFs lacking images.txt" file and convert those PDFs to EPUB, but this would certainly be an easy task to code into a Calibre add-in. I should learn how to do that.