MobileRead Forums - View Single Post - Only Convert PDFs with embedded OCRed text to EPUB?

Geremia · 12-24-2012, 12:26 PM

Quote:

Originally Posted by theducks

OCR'd is text.

Image only PDF's may need to be OCR'd.
So is your question: "how do I differentiate Text PDF's from Image PDF's?"

Exactly!

Quote:

Originally Posted by theducks

I would expect that a Image filled PDF file might be quite a bit larger.

Yes, that's certainly one way to guess, but I have text-only PDFs in my library that are 16 MB or more, and image ones that are less than that, so, since it seems Calibre can't do this itself, I wrote a BASH script in Linux that uses the command pdfimages:

Code:

#!/bin/bash

# This script will find all PDFs lacking images in a Calibre library
#
# Run it with this:
# find ~/Calibre\ Library/ -iname "*.pdf" -print0 | xargs -0 -I{} ./pdf_no_images.bash {} 2> /dev/null > "PDFs lacking images.txt"

images=`pdfimages -list "$1" | awk '{print $2}' | grep 0`

if [ -z "$images" ]; then
    echo "$1"
fi

It works well. The only problem is that I have to manually go through the "PDFs lacking images.txt" file and convert those PDFs to EPUB, but this would certainly be an easy task to code into a Calibre add-in. I should learn how to do that.