Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 01-21-2012, 11:56 PM   #1
owlman112
Junior Member
owlman112 is on a distinguished road
 
Posts: 5
Karma: 66
Join Date: Nov 2010
Location: Canada
Device: iPad, Sony PRS-T1
PDF Batch Text/Image Identifier

Hey all,

So I've found a decent PDF-to-Epub converter (Wondershare, $29.99), and I''ve been using it to convert my test-based PDFs.

So far, I've been manually opening up each PDF , scrolling down, and seeing if I a) highlight text (Making it a text-based PDF and OK to be converted) or
b) just make a highlighted box (making it an image-based PDF and must be put through an OCR program first).

I'm wondering if there is a program that can do this (identify whether each PDF is text- or image-based) for large amounts of files, so I can speed up the process in the future.

Cheers,

Owlman.
owlman112 is offline   Reply With Quote
Old 01-22-2012, 02:09 PM   #2
DSpider
Evangelist
DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.
 
DSpider's Avatar
 
Posts: 407
Karma: 326969
Join Date: Nov 2009
Location: Romania
Device: iPod touch 2G (16 GB)
Ugh... PDF is the worst possible format to convert FROM. It was designed as an output format. This subject has been beaten to death around here because a lot of PDFs aren't tagged PDFs - meaning that letters (and a lot of times small groups of letters) resemble something like floating objects on a blank paper, each with their own coordinates and extra baggage. So it's very difficult to get a 1:1 conversion. A lot of formatting will be lost, some will get interpreted wrong, etc... Doing this in batches and not taking the time to do a proper check is a bad idea. Why do you need them as ePub anyway? Knowing that you could ruin formatting.


The closest thing to what I think you're looking for is Adobe Acrobat's "Save As - Optimized PDF - Audit space usage". An information window will pop up and if it says there that images take up some crazy amount like 98-100%, chances are that the PDF is "image-based". But then again, if the book is chuck full of pictures, the filesize is usually a good indicator too... And you don't need $200 for that, you could simply right click the PDF file and choose Properties.

Also, in any PDF viewer you could press Ctrl+A to select everything and just scroll down a few pages. I'd say if the text in the first 10 pages or so is highlighted in blue (or whatever theme you have set), it's "text-based".

If some pages are images, some are text, then it's a sh*tty PDF.
DSpider is offline   Reply With Quote
 
Enthusiast
Old 01-22-2012, 09:12 PM   #3
owlman112
Junior Member
owlman112 is on a distinguished road
 
Posts: 5
Karma: 66
Join Date: Nov 2010
Location: Canada
Device: iPad, Sony PRS-T1
I need them in EPUB because (so far) my ebook reader can't resize PDF fonts, which is a feature that makes my reading them possible.

I realize it's pretty much an unresolved issue, but so far this one has treated me OK - but I have to admit I haven't gone through every new EPUB it makes.

Thanks DSpider, I'll give it a look
owlman112 is offline   Reply With Quote
Reply

Tags
batch, identifier, pdf

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Did you encounter a same problem about converting PDF to text/word/image Ivymin PDF 3 11-29-2011 08:45 PM
PDF Text AND Page Image.. wierd.. mathewb Sony Reader 0 07-08-2010 02:46 PM
PDF virtual printer as text not image mowbray Amazon Kindle 7 02-05-2010 12:32 PM
PDF Image -> OCR -> text frikk Workshop 9 07-08-2009 07:21 PM
Batch Image Convertor - Giveaway of the day - so hurry TetraKM Deals, Freebies, and Resources (No Self-Promotion) 2 10-10-2008 04:41 AM


All times are GMT -4. The time now is 08:44 PM.


MobileRead.com is a privately owned, operated and funded community.