Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 04-14-2009, 10:27 PM   #1
frikk
Amateur PRS505'er
frikk began at the beginning.
 
frikk's Avatar
 
Posts: 22
Karma: 10
Join Date: Apr 2009
Location: Cincinnati
Device: sony rs-505
PDF Image -> OCR -> text

hey guys,
i havent been able to track this one down in the forum. I have a bunch of PDFs that are images (ie not searchable), but are really just plain text documents. I believe I would want to 'OCR' the documents to have them be searchable, right? Whats the best way to do this? I hear you can do a 'paper capture' in Acrobat, but this is not software that I have access to at the moment. Are there any good free tools out there?

Thanks!
Blaine
frikk is offline   Reply With Quote
Old 04-15-2009, 01:38 AM   #2
Andybaby
Wizard
Andybaby plays well with othersAndybaby plays well with othersAndybaby plays well with othersAndybaby plays well with othersAndybaby plays well with othersAndybaby plays well with othersAndybaby plays well with othersAndybaby plays well with othersAndybaby plays well with othersAndybaby plays well with othersAndybaby plays well with others
 
Andybaby's Avatar
 
Posts: 1,279
Karma: 2683
Join Date: Nov 2008
Location: New York
Device: PRS-700
if you want to make them readable, the best program to use is ABBYY.

adobe will make it searchable if you need to find a term inside a pdf, but if you try to extract the text after that, it will look pretty bad.
Andybaby is offline   Reply With Quote
Old 04-15-2009, 02:55 AM   #3
=X=
Wizard
=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.
 
=X='s Avatar
 
Posts: 3,672
Karma: 12205348
Join Date: Mar 2008
Device: Galaxy S, Nook w/CM7
The only free OCR software I know of is tesseract and gocr. Tesseract is an open source OCR by google. They used a more optimized OCR for their books but I tend to see the same errors on their scan and my scans.

Tesseract only OCR uncompressed TIFF but there are some Free GUIs like Softi FreeOCR that support more image formats.


I do have a PDF->Text solution but it's not for the faint of heart.
It requires cygwin(for perl, pdf2ppm, ppm2tiff, convert(ImageMagik))

The perl script looks for all the PDF in a directory then extracts each page of the PDF into a text file. It's great for batch jobs

=X=
=X= is offline   Reply With Quote
Old 04-15-2009, 06:55 AM   #4
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 6,308
Karma: 4898871
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by =X= View Post
I do have a PDF->Text solution but it's not for the faint of heart.
It requires cygwin(for perl, pdf2ppm, ppm2tiff, convert(ImageMagik))
Would you mind sharing it? I use linux, so I have all of those
Jellby is offline   Reply With Quote
Old 04-15-2009, 04:49 PM   #5
=X=
Wizard
=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.
 
=X='s Avatar
 
Posts: 3,672
Karma: 12205348
Join Date: Mar 2008
Device: Galaxy S, Nook w/CM7
Sure, it's just 3 scripts that are not overly impressive, but do the job for me
Make sure you've installed Tesseract and have it in your path.


xpdf2txt_byPage.pl : Extracts one page at a time and converts it to a text. The final product is a text file and a jpg.

xpng2txt_byPage: Converts any PNG file in the same directory to text.

xTxt2HTML: Creates one HTML file from the generated text file (NOTE: You might have to run dos2unix first, on cygwin you do)


(NOTE: Some of the executable have their paths hard coded. If the scripts do not work remove the paths.

=X=
Attached Files
File Type: zip PDF2TXT.zip (2.3 KB, 333 views)
=X= is offline   Reply With Quote
Old 04-15-2009, 11:21 PM   #6
frikk
Amateur PRS505'er
frikk began at the beginning.
 
frikk's Avatar
 
Posts: 22
Karma: 10
Join Date: Apr 2009
Location: Cincinnati
Device: sony rs-505
Thanks a bunch!
frikk is offline   Reply With Quote
Old 07-06-2009, 07:35 PM   #7
elegant
Member
elegant began at the beginning.
 
elegant's Avatar
 
Posts: 12
Karma: 10
Join Date: Apr 2009
Device: sony reader
PDF -> OCR - > Text.

This seems like the best option.

I have alot of major problems with standard text extraction from PDF and then using BookDesigner. For example, omission of italics and no flexibility in how to format paragraphs or get rid of line break problems.

The question is to find good OCR software. And find a good intermediary program/format in which to reformat the text.

Ideally I think I'd find it easiest to use OOo or Word for formatting, then convert the DOC or HTML to LRF.

Unfortunately the BookDesigner way is the only one with a reasonable amount of documentation, which I don't find to be flexible or to produce particularly elegant results.
elegant is offline   Reply With Quote
Old 07-08-2009, 03:08 PM   #8
DDHarriman
Guru
DDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheese
 
Posts: 854
Karma: 1200
Join Date: Feb 2008
Location: Almada, Portugal
Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note
Hi

One of these two will serve you perfectly:

Finereader Pro 9
http://finereader.abbyy.com/
or
Omnipage Pro 17
http://www.nuance.com/imaging/omnipa...ofessional.asp

Best regards,
DDHarriman is offline   Reply With Quote
Old 07-08-2009, 04:05 PM   #9
Elfwreck
Grand Sorcerer
Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.
 
Elfwreck's Avatar
 
Posts: 5,140
Karma: 24387938
Join Date: Nov 2008
Location: SF Bay Area, California, USA
Device: Clié; PRS-505; EZR Pocket Pro, PRS-600, Kobo Mini
Do you know if Omnipage is "zone-able" the way that Finereader is?

I use FR 7, but am considering upgrading, and wondering if Omnipage would work as well or better for me. I convert a lot of books with complex picture formatting, and have gotten used to FR's ability to manually zone some areas as text, some as image, some as table; does Omnipage have that option?
Elfwreck is offline   Reply With Quote
Old 07-08-2009, 08:21 PM   #10
DDHarriman
Guru
DDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheese
 
Posts: 854
Karma: 1200
Join Date: Feb 2008
Location: Almada, Portugal
Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note
Hi

Yes it is.
It does everything Finereder does and the outlook of the screen is really similar - they look and feel like twin brothers.

Omnipage is much more expensive then Finereader and does not offer much more, many people even say Finereader is better.

I can say that I find Finereader tools for correcting skewed pages, dispecle, cutting white space and splitting two page scans much more intuitive and streamlined.
(I’m talking about Omnpage 16 here, I did not managed to test version 17 yet).

My advice, if you want to upgrade go with Finereader, I have it installed in several clients and the price and features are a winner combination.

Best regards,
DDHarriman is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
PDF Text AND Page Image.. wierd.. mathewb Sony Reader 0 07-08-2010 03:46 PM
Google Adds OCR for PDF Files kjk News 0 06-22-2010 03:27 PM
PDF virtual printer as text not image mowbray Amazon Kindle 7 02-05-2010 01:32 PM
Converting OCR Text files jedavis1 Workshop 10 10-01-2009 11:09 PM
Free/Shareware PDF converters with OCR capability? Thorkin PDF 3 03-20-2009 10:27 AM


All times are GMT -4. The time now is 12:42 AM.


MobileRead.com is a privately owned, operated and funded community.