04-14-2009, 09:27 PM | #1 |
Amateur PRS505'er
Posts: 22
Karma: 10
Join Date: Apr 2009
Location: Cincinnati
Device: sony rs-505
|
PDF Image -> OCR -> text
hey guys,
i havent been able to track this one down in the forum. I have a bunch of PDFs that are images (ie not searchable), but are really just plain text documents. I believe I would want to 'OCR' the documents to have them be searchable, right? Whats the best way to do this? I hear you can do a 'paper capture' in Acrobat, but this is not software that I have access to at the moment. Are there any good free tools out there? Thanks! Blaine |
04-15-2009, 12:38 AM | #2 |
Wizard
Posts: 1,279
Karma: 1002683
Join Date: Nov 2008
Location: New York
Device: PRS-700
|
if you want to make them readable, the best program to use is ABBYY.
adobe will make it searchable if you need to find a term inside a pdf, but if you try to extract the text after that, it will look pretty bad. |
Advert | |
|
04-15-2009, 01:55 AM | #3 |
Wizard
Posts: 3,671
Karma: 12205348
Join Date: Mar 2008
Device: Galaxy S, Nook w/CM7
|
The only free OCR software I know of is tesseract and gocr. Tesseract is an open source OCR by google. They used a more optimized OCR for their books but I tend to see the same errors on their scan and my scans.
Tesseract only OCR uncompressed TIFF but there are some Free GUIs like Softi FreeOCR that support more image formats. I do have a PDF->Text solution but it's not for the faint of heart. It requires cygwin(for perl, pdf2ppm, ppm2tiff, convert(ImageMagik)) The perl script looks for all the PDF in a directory then extracts each page of the PDF into a text file. It's great for batch jobs =X= |
04-15-2009, 05:55 AM | #4 |
frumious Bandersnatch
Posts: 7,533
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
|
04-15-2009, 03:49 PM | #5 |
Wizard
Posts: 3,671
Karma: 12205348
Join Date: Mar 2008
Device: Galaxy S, Nook w/CM7
|
Sure, it's just 3 scripts that are not overly impressive, but do the job for me
Make sure you've installed Tesseract and have it in your path. xpdf2txt_byPage.pl : Extracts one page at a time and converts it to a text. The final product is a text file and a jpg. xpng2txt_byPage: Converts any PNG file in the same directory to text. xTxt2HTML: Creates one HTML file from the generated text file (NOTE: You might have to run dos2unix first, on cygwin you do) (NOTE: Some of the executable have their paths hard coded. If the scripts do not work remove the paths. =X= |
Advert | |
|
04-15-2009, 10:21 PM | #6 |
Amateur PRS505'er
Posts: 22
Karma: 10
Join Date: Apr 2009
Location: Cincinnati
Device: sony rs-505
|
Thanks a bunch!
|
07-06-2009, 06:35 PM | #7 |
Member
Posts: 12
Karma: 10
Join Date: Apr 2009
Device: sony reader
|
PDF -> OCR - > Text.
This seems like the best option. I have alot of major problems with standard text extraction from PDF and then using BookDesigner. For example, omission of italics and no flexibility in how to format paragraphs or get rid of line break problems. The question is to find good OCR software. And find a good intermediary program/format in which to reformat the text. Ideally I think I'd find it easiest to use OOo or Word for formatting, then convert the DOC or HTML to LRF. Unfortunately the BookDesigner way is the only one with a reasonable amount of documentation, which I don't find to be flexible or to produce particularly elegant results. |
07-08-2009, 02:08 PM | #8 |
Guru
Posts: 860
Karma: 4380
Join Date: Feb 2008
Location: Almada, Portugal
Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note
|
Hi
One of these two will serve you perfectly: Finereader Pro 9 http://finereader.abbyy.com/ or Omnipage Pro 17 http://www.nuance.com/imaging/omnipa...ofessional.asp Best regards, |
07-08-2009, 03:05 PM | #9 |
Grand Sorcerer
Posts: 5,185
Karma: 25133758
Join Date: Nov 2008
Location: SF Bay Area, California, USA
Device: Pocketbook Touch HD3 (Past: Kobo Mini, PEZ, PRS-505, Clié)
|
Do you know if Omnipage is "zone-able" the way that Finereader is?
I use FR 7, but am considering upgrading, and wondering if Omnipage would work as well or better for me. I convert a lot of books with complex picture formatting, and have gotten used to FR's ability to manually zone some areas as text, some as image, some as table; does Omnipage have that option? |
07-08-2009, 07:21 PM | #10 |
Guru
Posts: 860
Karma: 4380
Join Date: Feb 2008
Location: Almada, Portugal
Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note
|
Hi
Yes it is. It does everything Finereder does and the outlook of the screen is really similar - they look and feel like twin brothers. Omnipage is much more expensive then Finereader and does not offer much more, many people even say Finereader is better. I can say that I find Finereader tools for correcting skewed pages, dispecle, cutting white space and splitting two page scans much more intuitive and streamlined. (I’m talking about Omnpage 16 here, I did not managed to test version 17 yet). My advice, if you want to upgrade go with Finereader, I have it installed in several clients and the price and features are a winner combination. Best regards, |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
PDF Text AND Page Image.. wierd.. | mathewb | Sony Reader | 0 | 07-08-2010 02:46 PM |
Google Adds OCR for PDF Files | kjk | News | 0 | 06-22-2010 02:27 PM |
PDF virtual printer as text not image | mowbray | Amazon Kindle | 7 | 02-05-2010 12:32 PM |
Converting OCR Text files | jedavis1 | Workshop | 10 | 10-01-2009 10:09 PM |
Free/Shareware PDF converters with OCR capability? | Thorkin | 3 | 03-20-2009 09:27 AM |