01-21-2012, 11:56 PM | #1 |
Junior Member
Posts: 5
Karma: 66
Join Date: Nov 2010
Location: Canada
Device: iPad, Sony PRS-T1
|
PDF Batch Text/Image Identifier
Hey all,
So I've found a decent PDF-to-Epub converter (Wondershare, $29.99), and I''ve been using it to convert my test-based PDFs. So far, I've been manually opening up each PDF , scrolling down, and seeing if I a) highlight text (Making it a text-based PDF and OK to be converted) or b) just make a highlighted box (making it an image-based PDF and must be put through an OCR program first). I'm wondering if there is a program that can do this (identify whether each PDF is text- or image-based) for large amounts of files, so I can speed up the process in the future. Cheers, Owlman. |
01-22-2012, 02:09 PM | #2 |
Evangelist
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
Ugh... PDF is the worst possible format to convert FROM. It was designed as an output format. This subject has been beaten to death around here because a lot of PDFs aren't tagged PDFs - meaning that letters (and a lot of times small groups of letters) resemble something like floating objects on a blank paper, each with their own coordinates and extra baggage. So it's very difficult to get a 1:1 conversion. A lot of formatting will be lost, some will get interpreted wrong, etc... Doing this in batches and not taking the time to do a proper check is a bad idea. Why do you need them as ePub anyway? Knowing that you could ruin formatting.
The closest thing to what I think you're looking for is Adobe Acrobat's "Save As - Optimized PDF - Audit space usage". An information window will pop up and if it says there that images take up some crazy amount like 98-100%, chances are that the PDF is "image-based". But then again, if the book is chuck full of pictures, the filesize is usually a good indicator too... And you don't need $200 for that, you could simply right click the PDF file and choose Properties. Also, in any PDF viewer you could press Ctrl+A to select everything and just scroll down a few pages. I'd say if the text in the first 10 pages or so is highlighted in blue (or whatever theme you have set), it's "text-based". If some pages are images, some are text, then it's a sh*tty PDF. |
Advert | |
|
01-22-2012, 09:12 PM | #3 |
Junior Member
Posts: 5
Karma: 66
Join Date: Nov 2010
Location: Canada
Device: iPad, Sony PRS-T1
|
I need them in EPUB because (so far) my ebook reader can't resize PDF fonts, which is a feature that makes my reading them possible.
I realize it's pretty much an unresolved issue, but so far this one has treated me OK - but I have to admit I haven't gone through every new EPUB it makes. Thanks DSpider, I'll give it a look |
Tags |
batch, identifier, pdf |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Did you encounter a same problem about converting PDF to text/word/image | Ivymin | 3 | 11-29-2011 08:45 PM | |
PDF Text AND Page Image.. wierd.. | mathewb | Sony Reader | 0 | 07-08-2010 02:46 PM |
PDF virtual printer as text not image | mowbray | Amazon Kindle | 7 | 02-05-2010 12:32 PM |
PDF Image -> OCR -> text | frikk | Workshop | 9 | 07-08-2009 07:21 PM |
Batch Image Convertor - Giveaway of the day - so hurry | TetraKM | Deals and Resources (No Self-Promotion or Affiliate Links) | 2 | 10-10-2008 04:41 AM |