Only Convert PDFs with embedded OCRed text to EPUB?

Geremia · 12-24-2012, 03:08 AM

Is there a way to find only PDFs that have embedded text (such as OCRed PDFs) and only convert those PDFs to EPUB? Thanks

theducks · 12-24-2012, 09:46 AM

OCR'd is text.

Image only PDF's may need to be OCR'd.
So is your question: "how do I differentiate Text PDF's from Image PDF's?"

I would expect that a Image filled PDF file might be quite a bit larger.

Geremia · 12-24-2012, 12:26 PM

Quote:

Originally Posted by theducks

OCR'd is text.

Image only PDF's may need to be OCR'd.
So is your question: "how do I differentiate Text PDF's from Image PDF's?"

Exactly!

Quote:

Originally Posted by theducks

I would expect that a Image filled PDF file might be quite a bit larger.

Yes, that's certainly one way to guess, but I have text-only PDFs in my library that are 16 MB or more, and image ones that are less than that, so, since it seems Calibre can't do this itself, I wrote a BASH script in Linux that uses the command pdfimages:

Code:

#!/bin/bash

# This script will find all PDFs lacking images in a Calibre library
#
# Run it with this:
# find ~/Calibre\ Library/ -iname "*.pdf" -print0 | xargs -0 -I{} ./pdf_no_images.bash {} 2> /dev/null > "PDFs lacking images.txt"

images=`pdfimages -list "$1" | awk '{print $2}' | grep 0`

if [ -z "$images" ]; then
    echo "$1"
fi

It works well. The only problem is that I have to manually go through the "PDFs lacking images.txt" file and convert those PDFs to EPUB, but this would certainly be an easy task to code into a Calibre add-in. I should learn how to do that.

dgatwood · 12-24-2012, 01:11 PM

Better to use something like pdftotext and see if it returns nothing. PDF files might contain both images *and* text, and I'm assuming you probably want to convert those as well.

Geremia · 12-24-2012, 03:33 PM

Quote:

Originally Posted by dgatwood

Better to use something like pdftotext and see if it returns nothing. PDF files might contain both images *and* text, and I'm assuming you probably want to convert those as well.

Oh yes, that's a good point. What I wrote above only finds pure-text PDFs, not mixed text/image ones like the PDFs from Archive.org. I don't think I have many, if any, mixed text/image PDFs, but all my DJVU books are that way. PDFs from Google Books or HathiTrust are mostly images, but they do have a small amount of text for copyright, etc., so making a script ignore that would be more complex.

12-24-2012, 03:08 AM	#1
Geremia Addict Posts: 256 Karma: 100000 Join Date: Oct 2012 Device: Calibre	Only Convert PDFs with embedded OCRed text to EPUB? Is there a way to find only PDFs that have embedded text (such as OCRed PDFs) and only convert those PDFs to EPUB? Thanks

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Convert PDF to EPUB in Text not pictures.	looloo	ePub	8	10-27-2014 11:08 AM
Problem with EPUB/OCRed PDF and their convertion	tuliouel	Conversion	2	07-24-2012 06:38 AM
Convert EPUB to HTML Zip extra meta text	meme	Conversion	2	05-28-2012 01:34 PM
text -> epub as a tool to simply convert	ingyu72	Sony Reader	0	09-17-2009 08:59 PM

12-24-2012, 09:46 AM	#2
theducks Well trained by Cats Posts: 30,880 Karma: 59840450 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	OCR'd is text. Image only PDF's may need to be OCR'd. So is your question: "how do I differentiate Text PDF's from Image PDF's?" I would expect that a Image filled PDF file might be quite a bit larger.

12-24-2012, 01:11 PM	#4
dgatwood Curmudgeon Posts: 629 Karma: 1623086 Join Date: Jan 2012 Device: iPad, iPhone, Nook Simple Touch	Better to use something like pdftotext and see if it returns nothing. PDF files might contain both images and text, and I'm assuming you probably want to convert those as well.

Advert

Advert