View Single Post
Old 12-24-2012, 12:26 PM   #3
Geremia
Addict
Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!
 
Posts: 235
Karma: 100000
Join Date: Oct 2012
Device: Calibre
Quote:
Originally Posted by theducks View Post
OCR'd is text.

Image only PDF's may need to be OCR'd.
So is your question: "how do I differentiate Text PDF's from Image PDF's?"
Exactly!
Quote:
Originally Posted by theducks View Post
I would expect that a Image filled PDF file might be quite a bit larger.
Yes, that's certainly one way to guess, but I have text-only PDFs in my library that are 16 MB or more, and image ones that are less than that, so, since it seems Calibre can't do this itself, I wrote a BASH script in Linux that uses the command pdfimages:
Code:
#!/bin/bash

# This script will find all PDFs lacking images in a Calibre library
#
# Run it with this:
# find ~/Calibre\ Library/ -iname "*.pdf" -print0 | xargs -0 -I{} ./pdf_no_images.bash {} 2> /dev/null > "PDFs lacking images.txt"

images=`pdfimages -list "$1" | awk '{print $2}' | grep 0`

if [ -z "$images" ]; then
    echo "$1"
fi
It works well. The only problem is that I have to manually go through the "PDFs lacking images.txt" file and convert those PDFs to EPUB, but this would certainly be an easy task to code into a Calibre add-in. I should learn how to do that.
Geremia is offline   Reply With Quote