OCR engine

qsipl · 03-19-2014, 06:46 AM

Hi...,

Can any one suggest OCR engine which can give good text accuracy.

HarryT · 03-19-2014, 06:49 AM

Abbyy FineReader is an excellent OCR package.

wannabee · 03-20-2014, 01:33 AM

I use Acrobat X1 Pro. You can download a trial of all Adobe software for 30 days free or rent all 55 programs for 50 bucks a month.

Hamlet53 · 03-20-2014, 06:56 AM

Quote:

Originally Posted by HarryT

Abbyy FineReader is an excellent OCR package.

I have had very good experience with this as well. It helps that it was part of the software package that came with my scanner.

AJ Starr · 03-20-2014, 09:47 AM

Quote:

Originally Posted by qsipl

Hi...,

Can any one suggest OCR engine which can give good text accuracy.

I don't know about other word processors, but my WordPerfect has the ability to "Open PDF" and it will OCR the text almost perfect. It depends on the quality of my scanned PDF document, i.e., stray lines, darkness, etc. But, provided my scanned pdf is clear, the ocr'd text is about 95% accurate.

AJ

rkomar · 03-20-2014, 07:19 PM

Quote:

Originally Posted by AJ Starr

...But, provided my scanned pdf is clear, the ocr'd text is about 95% accurate.

Maybe you're just guesstimating the accuracy, but 95% is not good. 95% for characters is terrible, and 95% for words is marginally acceptable. A typical printed page has something like 50 characters per line and 40 lines per page, so about 2000 characters per page. A 95% success rate per character would result in about 100 bad characters per page. A 95% success rate per word would bring that down to about 20 or 25 bad words per page. Even 99% accuracy produces more errors than most people like. You'd have to get to about 99.9% accuracy before you could think about not proofing the text afterwards.

susan_cassidy · 03-20-2014, 07:21 PM

I would never NOT proof an OCRed document.

markom · 03-20-2014, 08:30 PM

Quote:

Originally Posted by susan_cassidy

I would never NOT proof an OCRed document.

I would never proofread an ocr-ed document because I would either use exact pdf image (ocr layer in the background) in Abbyy Finereader or clearscan in Acrobat for documents that need 100% exactness, or would use plain ocr-ed txt from Abbyy Finereader for novels and other documents that allow for a few mistakes here and there.

AJ Starr · 03-20-2014, 10:31 PM

Quote:

Originally Posted by rkomar

Maybe you're just guesstimating the accuracy, but 95% is not good. 95% for characters is terrible, and 95% for words is marginally acceptable. A typical printed page has something like 50 characters per line and 40 lines per page, so about 2000 characters per page. A 95% success rate per character would result in about 100 bad characters per page. A 95% success rate per word would bring that down to about 20 or 25 bad words per page. Even 99% accuracy produces more errors than most people like. You'd have to get to about 99.9% accuracy before you could think about not proofing the text afterwards.

Yes, I was guestimating, based on an entire novel. Scanning 1960's era paperbacks which are yellowed and abused. (Though I took very good care of my PB's)

I often got "1" instead of "I" or "l"; "m" instead of "r n" ; odd Hard Returns on the last line of a paragraph instead of Softreturns. So for an entire novel, 95% or better is more than acceptable to me.

(My epubs come out great!)

Marcy · 03-21-2014, 04:51 PM

Quote:

Originally Posted by AJ Starr

Yes, I was guestimating, based on an entire novel. Scanning 1960's era paperbacks which are yellowed and abused. (Though I took very good care of my PB's)

I often got "1" instead of "I" or "l"; "m" instead of "r n" ; odd Hard Returns on the last line of a paragraph instead of Softreturns. So for an entire novel, 95% or better is more than acceptable to me.

(My epubs come out great!)

Sorry for hijacking the thread, but how long does it take you to do an entire paperback? How are you scanning?

The only way I see to do this with any speed is to take apart the book so the pages could be put through an ADF instead of having to turn the pages and flatten the book each time. Is that what you're doing or do you have an alternative?

Thanks,
Marcy

wannabee · 03-21-2014, 07:25 PM

Oh! I've never used OCR for an entire book. Mainly for documents that are easily proofed. Acrobat mixes up "1"s and "l"s too.
I remember OmniPage would flag any words not in the dictionary and provide a list of probables for you to chose. That made proofing pretty easy. I think its pretty expensive though.

Ripplinger · 03-21-2014, 10:36 PM

Quote:

Originally Posted by Marcy

Sorry for hijacking the thread, but how long does it take you to do an entire paperback? How are you scanning?

The only way I see to do this with any speed is to take apart the book so the pages could be put through an ADF instead of having to turn the pages and flatten the book each time. Is that what you're doing or do you have an alternative?

Thanks,
Marcy

It takes me about an hour to scan a hardcover book (typical size of around 250 pages) on a flatbed scanner. This is just an inexpensive, slow scanner by Canon, but it works very well. I just park the scanner and myself in front of the TV, set ABBYY to scan every 5-7 seconds, and just press down against the binding to flatten the double pages while it scans. That gives me enough time to turn the page and position the book and not be super rushed doing it.

The time pretty much flies if you're busy doing something else, so find a good TV show you're insterested in and you'll be done before you know it. I did my first book just standing there next to my computer and could barely stand doing 25 pages a day.

Tex2002ans · 03-22-2014, 09:10 AM

Quote:

Originally Posted by Marcy

Sorry for hijacking the thread, but how long does it take you to do an entire paperback? How are you scanning?

The only way I see to do this with any speed is to take apart the book so the pages could be put through an ADF instead of having to turn the pages and flatten the book each time. Is that what you're doing or do you have an alternative?

There are a lot of methods of digitizing text at http://www.diybookscanner.org/

There is pretty much:

Destructive
- Cut off binding, feed through a machine.
  - Advantage: Fast, high quality scans.
  - Disadvantage: You "lose" the book. (you just get sheets of paper out of it)
Non-destructive
- Take Images using a camera
  - Advantage: Fast
  - Disadvantage: Might not be high enough resolution/DPI (may look fine to the human eye, but be inaccurate when OCRed). Depending on your setup, you may get inconsistent images.
- Scanner
  - Advantage: High quality.
  - Disadvantage: SLOOOOOOOW

Quote:

Originally Posted by rkomar

Maybe you're just guesstimating the accuracy, but 95% is not good. 95% for characters is terrible, and 95% for words is marginally acceptable. A typical printed page has something like 50 characters per line and 40 lines per page, so about 2000 characters per page. A 95% success rate per character would result in about 100 bad characters per page. A 95% success rate per word would bring that down to about 20 or 25 bad words per page. Even 99% accuracy produces more errors than most people like. You'd have to get to about 99.9% accuracy before you could think about not proofing the text afterwards.

Even "99.9%" accuracy is an unacceptable amount of errors when reading. I just completed a 430 page non-fiction economics book, the character count is 854196 characters. 99.9% accuracy means that there would be ~850 errors. I do not believe these OCR "accuracies" the companies throw out takes into account formatting errors (wrong italic/bold/superscript/subscript, ...) which get introduced as well.

Then on top of the OCR, you have to fix broken paragraphs, add in proper indentation, check for missing quotation marks, adding in blockquotes, check for actual typos/errors in the physical/PDF book, etc. etc.

I do book conversion professionally, and mostly work with non-fiction economics books (lots of footnotes). Other types of books might be eaiser/faster, but If I want to completely proof a book and get a completed/finalized EPUB out of it, it takes me ~8-15 hours of work (although when I first started it used to take me ~2 weeks to convert a book).

I explained a lot of the method in here:

https://www.mobileread.com/forums/sho...d.php?t=223817

and in here:

https://www.mobileread.com/forums/sho...d.php?t=234146

I personally use ABBYY Finereader (because in my testing it has been the most accurate). But the same methods should apply no matter what OCR program you are using.

HarryT · 03-22-2014, 09:19 AM

Quote:

Originally Posted by Tex2002ans

I do book conversion professionally, and mostly work with non-fiction economics books (lots of footnotes). Other types of books might be eaiser/faster, but If I want to do completely proof a book and get a completed/finalized EPUB out of it, it takes me ~8-15 hours of work (although when I first started it used to take me ~2 weeks to convert a book).

I find it difficult to believe that you can "completely proof" a book in 8-15h. "Completely proof" to me means comparing the ebook to the original source material comma by comma, letter by letter, word by word, line by line. I've been proof-reading books for many years and can't manage more than 10 pages an hour, or some 40h for a 400 page book.

AJ Starr · 03-22-2014, 09:24 AM

Quote:

Originally Posted by Marcy

Sorry for hijacking the thread, but how long does it take you to do an entire paperback? How are you scanning?

The only way I see to do this with any speed is to take apart the book so the pages could be put through an ADF instead of having to turn the pages and flatten the book each time. Is that what you're doing or do you have an alternative?

Thanks,
Marcy

Unlike the poster a few messages back, I take a lot longer than a couple of hours. I'm not sure how his scanner does the files, pdf or text, but my current situation is different.

I had an old (HP I think) flat bed scanner that would OCR the text and let me take it directly to my WP. On that one it didn't take long, even the paperbacks. However, it was an old XP compatible scanner and it did not get upgrades with the new OS. (I'm on Win 7 now)

My current scanner is a all-in-one and it scans nicely but not to OCR. So.......

I flatten the pb on the screen. Set the preview to identify the two different pages (making sure they are in order) then scan. I always put weights on the pb to hold it flat. It lets me continue to scan into a multi page PDF until I save the file. I will scan a chapter at a time and save it at that point. A chapter scan, depending on the number of pages and the difficulty with light leaking in where I have to rescan, takes me about 5 to 30 minutes. I will rescan a page many times if needed to get the lettering clear.

But that is just the first step for me. Then I convert the pdf to wp, edit each chapter for errors and formatting. Then convert to epub.

So anywhere for at least a full day, to a couple of weeks depending on how much I spend each time.

Hope this answers your question.

AJ

I have been looking at a portable double sided ASF scanner, a Brother 720D, lately, but haven't purchased it yet. It would necessitate taking my pbs apart and scanning page by page. Does anyone have one?

03-19-2014, 06:46 AM	#1
qsipl Enthusiast Posts: 25 Karma: 412584 Join Date: Feb 2014 Device: IPAD, KF8 & Tablet	OCR engine Hi..., Can any one suggest OCR engine which can give good text accuracy.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Regex engine	huebi	Sigil	1	02-23-2012 02:53 AM
How to convert an OCR file to a Non-OCR one	res9282	PDF	1	08-05-2011 05:58 AM
Search Engine	alroy	Calibre	1	11-06-2010 01:39 AM
Regex engine?	troymc	Sigil	10	07-09-2010 04:52 PM

03-19-2014, 06:49 AM	#2
HarryT eBook Enthusiast Posts: 85,560 Karma: 93980705 Join Date: Nov 2006 Location: UK Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6	Abbyy FineReader is an excellent OCR package.

03-20-2014, 01:33 AM	#3
wannabee Media Bloke Posts: 2,382 Karma: 113956855 Join Date: Sep 2010 Location: NSW - Australia Device: iOS	I use Acrobat X1 Pro. You can download a trial of all Adobe software for 30 days free or rent all 55 programs for 50 bucks a month.

03-20-2014, 07:21 PM	#7
susan_cassidy Wizard Posts: 2,251 Karma: 3720310 Join Date: Jan 2009 Location: USA Device: Kindle, iPad (not used much for reading)	I would never NOT proof an OCRed document.

03-21-2014, 07:25 PM	#11
wannabee Media Bloke Posts: 2,382 Karma: 113956855 Join Date: Sep 2010 Location: NSW - Australia Device: iOS	Oh! I've never used OCR for an entire book. Mainly for documents that are easily proofed. Acrobat mixes up "1"s and "l"s too. I remember OmniPage would flag any words not in the dictionary and provide a list of probables for you to chose. That made proofing pretty easy. I think its pretty expensive though.