![]() |
#1 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 256
Karma: 100000
Join Date: Oct 2012
Device: Calibre
|
Only Convert PDFs with embedded OCRed text to EPUB?
Is there a way to find only PDFs that have embedded text (such as OCRed PDFs) and only convert those PDFs to EPUB? Thanks
|
![]() |
![]() |
![]() |
#2 |
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 30,880
Karma: 59840450
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
OCR'd is text.
Image only PDF's may need to be OCR'd. So is your question: "how do I differentiate Text PDF's from Image PDF's?" I would expect that a Image filled PDF file might be quite a bit larger. |
![]() |
![]() |
Advert | |
|
![]() |
#3 | ||
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 256
Karma: 100000
Join Date: Oct 2012
Device: Calibre
|
Quote:
Quote:
Code:
#!/bin/bash # This script will find all PDFs lacking images in a Calibre library # # Run it with this: # find ~/Calibre\ Library/ -iname "*.pdf" -print0 | xargs -0 -I{} ./pdf_no_images.bash {} 2> /dev/null > "PDFs lacking images.txt" images=`pdfimages -list "$1" | awk '{print $2}' | grep 0` if [ -z "$images" ]; then echo "$1" fi |
||
![]() |
![]() |
![]() |
#4 |
Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 629
Karma: 1623086
Join Date: Jan 2012
Device: iPad, iPhone, Nook Simple Touch
|
Better to use something like pdftotext and see if it returns nothing. PDF files might contain both images *and* text, and I'm assuming you probably want to convert those as well.
|
![]() |
![]() |
![]() |
#5 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 256
Karma: 100000
Join Date: Oct 2012
Device: Calibre
|
Oh yes, that's a good point. What I wrote above only finds pure-text PDFs, not mixed text/image ones like the PDFs from Archive.org. I don't think I have many, if any, mixed text/image PDFs, but all my DJVU books are that way. PDFs from Google Books or HathiTrust are mostly images, but they do have a small amount of text for copyright, etc., so making a script ignore that would be more complex.
|
![]() |
![]() |
Advert | |
|
![]() |
Tags |
conversion from .pdf, epub, ocr, pdf, text |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Convert PDF to EPUB in Text not pictures. | looloo | ePub | 8 | 10-27-2014 11:08 AM |
Problem with EPUB/OCRed PDF and their convertion | tuliouel | Conversion | 2 | 07-24-2012 06:38 AM |
Convert EPUB to HTML Zip extra meta text | meme | Conversion | 2 | 05-28-2012 01:34 PM |
text -> epub as a tool to simply convert | ingyu72 | Sony Reader | 0 | 09-17-2009 08:59 PM |