PDF Image -> OCR -> text

frikk · 04-14-2009, 09:27 PM

hey guys,
i havent been able to track this one down in the forum. I have a bunch of PDFs that are images (ie not searchable), but are really just plain text documents. I believe I would want to 'OCR' the documents to have them be searchable, right? Whats the best way to do this? I hear you can do a 'paper capture' in Acrobat, but this is not software that I have access to at the moment. Are there any good free tools out there?

Thanks!
Blaine

Andybaby · 04-15-2009, 12:38 AM

if you want to make them readable, the best program to use is ABBYY.

adobe will make it searchable if you need to find a term inside a pdf, but if you try to extract the text after that, it will look pretty bad.

=X= · 04-15-2009, 01:55 AM

The only free OCR software I know of is tesseract and gocr. Tesseract is an open source OCR by google. They used a more optimized OCR for their books but I tend to see the same errors on their scan and my scans.

Tesseract only OCR uncompressed TIFF but there are some Free GUIs like Softi FreeOCR that support more image formats.

I do have a PDF->Text solution but it's not for the faint of heart.
It requires cygwin(for perl, pdf2ppm, ppm2tiff, convert(ImageMagik))

The perl script looks for all the PDF in a directory then extracts each page of the PDF into a text file. It's great for batch jobs

=X=

Jellby · 04-15-2009, 05:55 AM

Quote:

Originally Posted by =X=

I do have a PDF->Text solution but it's not for the faint of heart.
It requires cygwin(for perl, pdf2ppm, ppm2tiff, convert(ImageMagik))

Would you mind sharing it? I use linux, so I have all of those

=X= · 04-15-2009, 03:49 PM

Sure, it's just 3 scripts that are not overly impressive, but do the job for me
Make sure you've installed Tesseract and have it in your path.

xpdf2txt_byPage.pl : Extracts one page at a time and converts it to a text. The final product is a text file and a jpg.

xpng2txt_byPage: Converts any PNG file in the same directory to text.

xTxt2HTML: Creates one HTML file from the generated text file (NOTE: You might have to run dos2unix first, on cygwin you do)

(NOTE: Some of the executable have their paths hard coded. If the scripts do not work remove the paths.

=X=

frikk · 04-15-2009, 10:21 PM

Thanks a bunch!

elegant · 07-06-2009, 06:35 PM

PDF -> OCR - > Text.

This seems like the best option.

I have alot of major problems with standard text extraction from PDF and then using BookDesigner. For example, omission of italics and no flexibility in how to format paragraphs or get rid of line break problems.

The question is to find good OCR software. And find a good intermediary program/format in which to reformat the text.

Ideally I think I'd find it easiest to use OOo or Word for formatting, then convert the DOC or HTML to LRF.

Unfortunately the BookDesigner way is the only one with a reasonable amount of documentation, which I don't find to be flexible or to produce particularly elegant results.

DDHarriman · 07-08-2009, 02:08 PM

Hi

One of these two will serve you perfectly:

Finereader Pro 9
http://finereader.abbyy.com/
or
Omnipage Pro 17
http://www.nuance.com/imaging/omnipa...ofessional.asp

Best regards,

Elfwreck · 07-08-2009, 03:05 PM

Do you know if Omnipage is "zone-able" the way that Finereader is?

I use FR 7, but am considering upgrading, and wondering if Omnipage would work as well or better for me. I convert a lot of books with complex picture formatting, and have gotten used to FR's ability to manually zone some areas as text, some as image, some as table; does Omnipage have that option?

DDHarriman · 07-08-2009, 07:21 PM

Hi

Yes it is.
It does everything Finereder does and the outlook of the screen is really similar - they look and feel like twin brothers.

Omnipage is much more expensive then Finereader and does not offer much more, many people even say Finereader is better.

I can say that I find Finereader tools for correcting skewed pages, dispecle, cutting white space and splitting two page scans much more intuitive and streamlined.
(I’m talking about Omnpage 16 here, I did not managed to test version 17 yet).

My advice, if you want to upgrade go with Finereader, I have it installed in several clients and the price and features are a winner combination.

Best regards,

04-14-2009, 09:27 PM	#1
frikk Amateur PRS505'er Posts: 22 Karma: 10 Join Date: Apr 2009 Location: Cincinnati Device: sony rs-505	PDF Image -> OCR -> text hey guys, i havent been able to track this one down in the forum. I have a bunch of PDFs that are images (ie not searchable), but are really just plain text documents. I believe I would want to 'OCR' the documents to have them be searchable, right? Whats the best way to do this? I hear you can do a 'paper capture' in Acrobat, but this is not software that I have access to at the moment. Are there any good free tools out there? Thanks! Blaine

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
PDF Text AND Page Image.. wierd..	mathewb	Sony Reader	0	07-08-2010 02:46 PM
Google Adds OCR for PDF Files	kjk	News	0	06-22-2010 02:27 PM
PDF virtual printer as text not image	mowbray	Amazon Kindle	7	02-05-2010 12:32 PM
Converting OCR Text files	jedavis1	Workshop	10	10-01-2009 10:09 PM
Free/Shareware PDF converters with OCR capability?	Thorkin	PDF	3	03-20-2009 09:27 AM

04-15-2009, 12:38 AM	#2
Andybaby Wizard Posts: 1,279 Karma: 1002683 Join Date: Nov 2008 Location: New York Device: PRS-700	if you want to make them readable, the best program to use is ABBYY. adobe will make it searchable if you need to find a term inside a pdf, but if you try to extract the text after that, it will look pretty bad.

04-15-2009, 01:55 AM	#3
=X= Wizard Posts: 3,671 Karma: 12205348 Join Date: Mar 2008 Device: Galaxy S, Nook w/CM7	The only free OCR software I know of is tesseract and gocr. Tesseract is an open source OCR by google. They used a more optimized OCR for their books but I tend to see the same errors on their scan and my scans. Tesseract only OCR uncompressed TIFF but there are some Free GUIs like Softi FreeOCR that support more image formats. I do have a PDF->Text solution but it's not for the faint of heart. It requires cygwin(for perl, pdf2ppm, ppm2tiff, convert(ImageMagik)) The perl script looks for all the PDF in a directory then extracts each page of the PDF into a text file. It's great for batch jobs =X=

04-15-2009, 10:21 PM	#6
frikk Amateur PRS505'er Posts: 22 Karma: 10 Join Date: Apr 2009 Location: Cincinnati Device: sony rs-505	Thanks a bunch!

07-06-2009, 06:35 PM	#7
elegant Member Posts: 12 Karma: 10 Join Date: Apr 2009 Device: sony reader	PDF -> OCR - > Text. This seems like the best option. I have alot of major problems with standard text extraction from PDF and then using BookDesigner. For example, omission of italics and no flexibility in how to format paragraphs or get rid of line break problems. The question is to find good OCR software. And find a good intermediary program/format in which to reformat the text. Ideally I think I'd find it easiest to use OOo or Word for formatting, then convert the DOC or HTML to LRF. Unfortunately the BookDesigner way is the only one with a reasonable amount of documentation, which I don't find to be flexible or to produce particularly elegant results.

07-08-2009, 02:08 PM	#8
DDHarriman Guru Posts: 860 Karma: 4380 Join Date: Feb 2008 Location: Almada, Portugal Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note	Hi One of these two will serve you perfectly: Finereader Pro 9 http://finereader.abbyy.com/ or Omnipage Pro 17 http://www.nuance.com/imaging/omnipa...ofessional.asp Best regards,

07-08-2009, 03:05 PM	#9
Elfwreck Grand Sorcerer Posts: 5,185 Karma: 25133758 Join Date: Nov 2008 Location: SF Bay Area, California, USA Device: Pocketbook Touch HD3 (Past: Kobo Mini, PEZ, PRS-505, Clié)	Do you know if Omnipage is "zone-able" the way that Finereader is? I use FR 7, but am considering upgrading, and wondering if Omnipage would work as well or better for me. I convert a lot of books with complex picture formatting, and have gotten used to FR's ability to manually zone some areas as text, some as image, some as table; does Omnipage have that option?

07-08-2009, 07:21 PM	#10
DDHarriman Guru Posts: 860 Karma: 4380 Join Date: Feb 2008 Location: Almada, Portugal Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note	Hi Yes it is. It does everything Finereder does and the outlook of the screen is really similar - they look and feel like twin brothers. Omnipage is much more expensive then Finereader and does not offer much more, many people even say Finereader is better. I can say that I find Finereader tools for correcting skewed pages, dispecle, cutting white space and splitting two page scans much more intuitive and streamlined. (I’m talking about Omnpage 16 here, I did not managed to test version 17 yet). My advice, if you want to upgrade go with Finereader, I have it installed in several clients and the price and features are a winner combination. Best regards,

Advert

Advert