Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 12-24-2012, 03:08 AM   #1
Geremia
Addict
Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!
 
Posts: 235
Karma: 100000
Join Date: Oct 2012
Device: Calibre
Only Convert PDFs with embedded OCRed text to EPUB?

Is there a way to find only PDFs that have embedded text (such as OCRed PDFs) and only convert those PDFs to EPUB? Thanks
Geremia is offline   Reply With Quote
Old 12-24-2012, 09:46 AM   #2
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,784
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
OCR'd is text.

Image only PDF's may need to be OCR'd.
So is your question: "how do I differentiate Text PDF's from Image PDF's?"

I would expect that a Image filled PDF file might be quite a bit larger.
theducks is online now   Reply With Quote
Old 12-24-2012, 12:26 PM   #3
Geremia
Addict
Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!
 
Posts: 235
Karma: 100000
Join Date: Oct 2012
Device: Calibre
Quote:
Originally Posted by theducks View Post
OCR'd is text.

Image only PDF's may need to be OCR'd.
So is your question: "how do I differentiate Text PDF's from Image PDF's?"
Exactly!
Quote:
Originally Posted by theducks View Post
I would expect that a Image filled PDF file might be quite a bit larger.
Yes, that's certainly one way to guess, but I have text-only PDFs in my library that are 16 MB or more, and image ones that are less than that, so, since it seems Calibre can't do this itself, I wrote a BASH script in Linux that uses the command pdfimages:
Code:
#!/bin/bash

# This script will find all PDFs lacking images in a Calibre library
#
# Run it with this:
# find ~/Calibre\ Library/ -iname "*.pdf" -print0 | xargs -0 -I{} ./pdf_no_images.bash {} 2> /dev/null > "PDFs lacking images.txt"

images=`pdfimages -list "$1" | awk '{print $2}' | grep 0`

if [ -z "$images" ]; then
    echo "$1"
fi
It works well. The only problem is that I have to manually go through the "PDFs lacking images.txt" file and convert those PDFs to EPUB, but this would certainly be an easy task to code into a Calibre add-in. I should learn how to do that.
Geremia is offline   Reply With Quote
Old 12-24-2012, 01:11 PM   #4
dgatwood
Curmudgeon
dgatwood ought to be getting tired of karma fortunes by now.dgatwood ought to be getting tired of karma fortunes by now.dgatwood ought to be getting tired of karma fortunes by now.dgatwood ought to be getting tired of karma fortunes by now.dgatwood ought to be getting tired of karma fortunes by now.dgatwood ought to be getting tired of karma fortunes by now.dgatwood ought to be getting tired of karma fortunes by now.dgatwood ought to be getting tired of karma fortunes by now.dgatwood ought to be getting tired of karma fortunes by now.dgatwood ought to be getting tired of karma fortunes by now.dgatwood ought to be getting tired of karma fortunes by now.
 
dgatwood's Avatar
 
Posts: 629
Karma: 1623086
Join Date: Jan 2012
Device: iPad, iPhone, Nook Simple Touch
Better to use something like pdftotext and see if it returns nothing. PDF files might contain both images *and* text, and I'm assuming you probably want to convert those as well.
dgatwood is offline   Reply With Quote
Old 12-24-2012, 03:33 PM   #5
Geremia
Addict
Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!Geremia rocks like Gibraltar!
 
Posts: 235
Karma: 100000
Join Date: Oct 2012
Device: Calibre
Quote:
Originally Posted by dgatwood View Post
Better to use something like pdftotext and see if it returns nothing. PDF files might contain both images *and* text, and I'm assuming you probably want to convert those as well.
Oh yes, that's a good point. What I wrote above only finds pure-text PDFs, not mixed text/image ones like the PDFs from Archive.org. I don't think I have many, if any, mixed text/image PDFs, but all my DJVU books are that way. PDFs from Google Books or HathiTrust are mostly images, but they do have a small amount of text for copyright, etc., so making a script ignore that would be more complex.
Geremia is offline   Reply With Quote
Reply

Tags
conversion from .pdf, epub, ocr, pdf, text


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Convert PDF to EPUB in Text not pictures. looloo ePub 8 10-27-2014 11:08 AM
Problem with EPUB/OCRed PDF and their convertion tuliouel Conversion 2 07-24-2012 06:38 AM
Convert EPUB to HTML Zip extra meta text meme Conversion 2 05-28-2012 01:34 PM
text -> epub as a tool to simply convert ingyu72 Sony Reader 0 09-17-2009 08:59 PM


All times are GMT -4. The time now is 08:55 PM.


MobileRead.com is a privately owned, operated and funded community.