12-24-2012, 03:08 AM | #1 |
Addict
Posts: 235
Karma: 100000
Join Date: Oct 2012
Device: Calibre
|
Only Convert PDFs with embedded OCRed text to EPUB?
Is there a way to find only PDFs that have embedded text (such as OCRed PDFs) and only convert those PDFs to EPUB? Thanks
|
12-24-2012, 09:46 AM | #2 |
Well trained by Cats
Posts: 29,795
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
OCR'd is text.
Image only PDF's may need to be OCR'd. So is your question: "how do I differentiate Text PDF's from Image PDF's?" I would expect that a Image filled PDF file might be quite a bit larger. |
Advert | |
|
12-24-2012, 12:26 PM | #3 | ||
Addict
Posts: 235
Karma: 100000
Join Date: Oct 2012
Device: Calibre
|
Quote:
Quote:
Code:
#!/bin/bash # This script will find all PDFs lacking images in a Calibre library # # Run it with this: # find ~/Calibre\ Library/ -iname "*.pdf" -print0 | xargs -0 -I{} ./pdf_no_images.bash {} 2> /dev/null > "PDFs lacking images.txt" images=`pdfimages -list "$1" | awk '{print $2}' | grep 0` if [ -z "$images" ]; then echo "$1" fi |
||
12-24-2012, 01:11 PM | #4 |
Curmudgeon
Posts: 629
Karma: 1623086
Join Date: Jan 2012
Device: iPad, iPhone, Nook Simple Touch
|
Better to use something like pdftotext and see if it returns nothing. PDF files might contain both images *and* text, and I'm assuming you probably want to convert those as well.
|
12-24-2012, 03:33 PM | #5 |
Addict
Posts: 235
Karma: 100000
Join Date: Oct 2012
Device: Calibre
|
Oh yes, that's a good point. What I wrote above only finds pure-text PDFs, not mixed text/image ones like the PDFs from Archive.org. I don't think I have many, if any, mixed text/image PDFs, but all my DJVU books are that way. PDFs from Google Books or HathiTrust are mostly images, but they do have a small amount of text for copyright, etc., so making a script ignore that would be more complex.
|
Advert | |
|
Tags |
conversion from .pdf, epub, ocr, pdf, text |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Convert PDF to EPUB in Text not pictures. | looloo | ePub | 8 | 10-27-2014 11:08 AM |
Problem with EPUB/OCRed PDF and their convertion | tuliouel | Conversion | 2 | 07-24-2012 06:38 AM |
Convert EPUB to HTML Zip extra meta text | meme | Conversion | 2 | 05-28-2012 01:34 PM |
text -> epub as a tool to simply convert | ingyu72 | Sony Reader | 0 | 09-17-2009 08:59 PM |