View Full Version : Free/Shareware PDF converters with OCR capability?


Thorkin
03-05-2009, 02:22 PM
I'm trying to convert a set of classic, illustrated children's books ([url=http://www.archive.org/details/merryadventureso00pylerich]Howard Pyle's books of Robin Hood, King Arthur, etc.) from public-domain .pdfs to ebooks I can read on my kindle.

Problem is, they're image-based PDFs, and heavy with illustration. Some pdf converters can't process them at all; some strip out all the illustrations and just convert the text; some convert every page into an image, which leaves the images excellent (well, apart from the "digitized by' watermarks on every page which I'd like to crop out) but makes the text too small to easily read. The only PDF converter I've found that seemed able to process them the way I'd like is ABBYY -- but that has a fifty-page limit on the trial version, which isn't enough for even one book, much less Pyle's collected works.

So as best I can figure out, I need a pdf converter that can do OCR of text and will also leave in the various images. Anyone have any pointers? Thanks!

Elfwreck
03-11-2009, 05:28 PM
I don't think there are any shareware or free converters that will do the careful inclusion of both text & graphics. A few of them try, but tend to botch it. (And I'm not sure what those are; I remember trying to work with them and giving up and going back to FineReader.)

Adobe's Capture Reviewer was another versatile OCR program--but it was also expensive, and FR is better in many ways. (Not all ways. Capture Reviewer lets you set fonts and kerning; FineReader is atrocious at that.)

RWood
03-17-2009, 12:53 PM
A lot of what you are doing has already been done by the folks at Project Gutenberg. Try here (http://www.gutenberg.org/browse/authors/p#a491) for a listing of the books they have already converted. The zipped HTML files contain the images if available. I have used PG as the basis for many books I have posted.

Flogiston has already converted and posted several of Pyle's books to LRF format for the Sony. Robin is posted here (http://www.mobileread.com/forums/showthread.php?t=17570). Sadly, no PRC version for the Kindle.

Thorkin
03-20-2009, 10:27 AM
A lot of what you are doing has already been done by the folks at Project Gutenberg. Try here (http://www.gutenberg.org/browse/authors/p#a491) for a listing of the books they have already converted. The zipped HTML files contain the images if available. I have used PG as the basis for many books I have posted.

Flogiston has already converted and posted several of Pyle's books to LRF format for the Sony. Robin is posted here (http://www.mobileread.com/forums/showthread.php?t=17570). Sadly, no PRC version for the Kindle.

Thanks, that's a good resource -- I'd looked at Gutenberg before but dismissed it as the first few Pyle books I'd downloaded from there were pure text, without images -- every time I found an online copy of Robin Hood, I'd get excited at first, then see it was just pure text and get annoyed, then notice the "Project Gutenberg" tag :P

The only problem I see with the Gutenberg versions is they don't include the little in-line text blurbs Pyle put on either side of the page describing the action -- "Robin meets a stranger on the bridge" or whatever -- and they only have one of the three King Arthur books uploaded. Still, though, I should be able to convert those HTML pages to kindle format fairly easily, so thanks, that is a lot of the work already done for me.