General scanning/OCR advice?

bfollowell · 10-30-2010, 04:26 AM

I am preparing to start my first major scanning/conversion process and am curious what tools most of you use.

From what I've seen and read, Finereader seems to be pretty much the standard for OCR work. Unfortunately, I can't afford $400 for an OCR tool no matter how awesome it is. It seems like I may have a very old version lying around though, possibly v5.

What file format do most of you find gives you the best results for OCR work? I'm sure tifs are great but they can take up a ton of space. jpgs are much smaller but I worry about artifacts causing bad results. I've heard pngs give fairly decent results at a decent size.

Are there any good OCR tools that you can just point to a directory of page scan images and let it work through everything automatically are do you tend to go therough page-by-page?

Finally, do you try scan in such a way that your OCR tool will recognize italics and other speacial formatting or do you pretty much try to capture dumb text and then add the special formatting later?

Thanks for any information or advice any of you may be able to offer.

Sincerely,
- Byron Followell

hernep · 10-30-2010, 04:51 AM

I have used FreeOCR. It scans and do OCR. It shows result in its own window where you edit and save it. Free program. Only bad thing is that somehow I haven't got good results with quirky letters, like ä ö å.
But if you do english only, it works pretty good, for the price

http://www.paperfile.net/

Program does not save scanned files anywhere but do you really need them after OCR?

Iain · 10-31-2010, 06:08 AM

I've written a long post on my experience with scanning here.
Briefly, I used FineReader 10 which cost around 60 quid ($100), a guillotine ($200) and a Fujitsu fi6130 (£600) which was the largest cost.

You have to make a few decisions. Are you prepared to destroy your books (cutting the spines off)? This allows a vastly quicker process. How important are errors to you (if you hate typos output to PDF, otherwise ePub makes sense)? How do you value your time over your spending?

On output, if you pick PDF (or PDF/A) you will get a book out in 80MB (Tiff file 1GB) which is a good copy of the original. If you get the book into ePub format then it will be 1MB. I personally don't like PDF to read - I want to be able to set the font size and reflow the book.

Finally, even with FineReader 10, the quality varies from book to book. Mainly it is very good (character errors in the 1 in 10,000 range at a guess - formatting is less good). With some books though (probably font related) it makes more or less consistent errors little -> lidle perhaps). With decorative fonts especially in chapter headings, drop caps and initial paragraph text it can get things wrong more often.

Also, if you use the 'cut the spines off approach' you will get feed errors so you need to think about how to repair or re-process books which have stuck, missing, angled or torn pages.

Iain

10-30-2010, 04:26 AM	#1
bfollowell Fanatic Posts: 541 Karma: 1152752 Join Date: Aug 2010 Location: Evansville, IN, USA Device: Samsung Galaxy Tab 4 Nook & Samsung Galaxy Tab S 10.5	General scanning/OCR advice? I am preparing to start my first major scanning/conversion process and am curious what tools most of you use. From what I've seen and read, Finereader seems to be pretty much the standard for OCR work. Unfortunately, I can't afford $400 for an OCR tool no matter how awesome it is. It seems like I may have a very old version lying around though, possibly v5. What file format do most of you find gives you the best results for OCR work? I'm sure tifs are great but they can take up a ton of space. jpgs are much smaller but I worry about artifacts causing bad results. I've heard pngs give fairly decent results at a decent size. Are there any good OCR tools that you can just point to a directory of page scan images and let it work through everything automatically are do you tend to go therough page-by-page? Finally, do you try scan in such a way that your OCR tool will recognize italics and other speacial formatting or do you pretty much try to capture dumb text and then add the special formatting later? Thanks for any information or advice any of you may be able to offer. Sincerely, - Byron Followell

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Recommendation for basic scanning software (non OCR)	yunti	Workshop	1	11-27-2009 07:08 AM
OCR help needed	Nate the great	Workshop	7	09-21-2009 11:21 PM
OCR to use	pepak	Workshop	17	05-26-2008 05:30 PM
Newbie, Mac-user, non-techie... General advice?	Savonarola	iRex	8	04-27-2008 11:26 AM
Do I need the cradle? Purchase advice in general	fekg	Sony Reader	13	05-25-2007 02:42 PM

10-30-2010, 04:51 AM	#2
hernep Enthusiast Posts: 30 Karma: 42 Join Date: Oct 2010 Location: Finland Device: iRiver Story, iPad 2	I have used FreeOCR. It scans and do OCR. It shows result in its own window where you edit and save it. Free program. Only bad thing is that somehow I haven't got good results with quirky letters, like ä ö å. But if you do english only, it works pretty good, for the price http://www.paperfile.net/ Program does not save scanned files anywhere but do you really need them after OCR?

10-31-2010, 06:08 AM	#3
Iain Enthusiast Posts: 49 Karma: 14 Join Date: Jul 2010 Location: Harrogate, England Device: iPad	I've written a long post on my experience with scanning here. Briefly, I used FineReader 10 which cost around 60 quid ($100), a guillotine ($200) and a Fujitsu fi6130 (£600) which was the largest cost. You have to make a few decisions. Are you prepared to destroy your books (cutting the spines off)? This allows a vastly quicker process. How important are errors to you (if you hate typos output to PDF, otherwise ePub makes sense)? How do you value your time over your spending? On output, if you pick PDF (or PDF/A) you will get a book out in 80MB (Tiff file 1GB) which is a good copy of the original. If you get the book into ePub format then it will be 1MB. I personally don't like PDF to read - I want to be able to set the font size and reflow the book. Finally, even with FineReader 10, the quality varies from book to book. Mainly it is very good (character errors in the 1 in 10,000 range at a guess - formatting is less good). With some books though (probably font related) it makes more or less consistent errors little -> lidle perhaps). With decorative fonts especially in chapter headings, drop caps and initial paragraph text it can get things wrong more often. Also, if you use the 'cut the spines off approach' you will get feed errors so you need to think about how to repair or re-process books which have stuck, missing, angled or torn pages. Iain

Advert