Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book General > General Discussions

Notices

Reply
 
Thread Tools Search this Thread
Old 03-19-2014, 06:46 AM   #1
qsipl
Enthusiast
qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.
 
Posts: 25
Karma: 412584
Join Date: Feb 2014
Device: IPAD, KF8 & Tablet
OCR engine

Hi...,

Can any one suggest OCR engine which can give good text accuracy.
qsipl is offline   Reply With Quote
Old 03-19-2014, 06:49 AM   #2
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383099
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Abbyy FineReader is an excellent OCR package.
HarryT is offline   Reply With Quote
Old 03-20-2014, 01:33 AM   #3
wannabee
Media Bloke
wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.
 
Posts: 2,382
Karma: 113956855
Join Date: Sep 2010
Location: NSW - Australia
Device: iOS
I use Acrobat X1 Pro. You can download a trial of all Adobe software for 30 days free or rent all 55 programs for 50 bucks a month.
wannabee is offline   Reply With Quote
Old 03-20-2014, 06:56 AM   #4
Hamlet53
Nameless Being
 
Quote:
Originally Posted by HarryT View Post
Abbyy FineReader is an excellent OCR package.
I have had very good experience with this as well. It helps that it was part of the software package that came with my scanner.
  Reply With Quote
Old 03-20-2014, 09:47 AM   #5
AJ Starr
Guru
AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.
 
AJ Starr's Avatar
 
Posts: 815
Karma: 1029784
Join Date: May 2008
Location: Nebraska, USA
Device: PEZ, Color Libre, 2@Sony T1, Onyx i62HD
Quote:
Originally Posted by qsipl View Post
Hi...,

Can any one suggest OCR engine which can give good text accuracy.
I don't know about other word processors, but my WordPerfect has the ability to "Open PDF" and it will OCR the text almost perfect. It depends on the quality of my scanned PDF document, i.e., stray lines, darkness, etc. But, provided my scanned pdf is clear, the ocr'd text is about 95% accurate.

AJ
AJ Starr is offline   Reply With Quote
Old 03-20-2014, 07:19 PM   #6
rkomar
Wizard
rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.
 
Posts: 3,054
Karma: 18821071
Join Date: Oct 2010
Location: Sudbury, ON, Canada
Device: PRS-505, PB 902, PRS-T1, PB 623, PB 840, PB 633
Quote:
Originally Posted by AJ Starr View Post
...But, provided my scanned pdf is clear, the ocr'd text is about 95% accurate.
Maybe you're just guesstimating the accuracy, but 95% is not good. 95% for characters is terrible, and 95% for words is marginally acceptable. A typical printed page has something like 50 characters per line and 40 lines per page, so about 2000 characters per page. A 95% success rate per character would result in about 100 bad characters per page. A 95% success rate per word would bring that down to about 20 or 25 bad words per page. Even 99% accuracy produces more errors than most people like. You'd have to get to about 99.9% accuracy before you could think about not proofing the text afterwards.
rkomar is offline   Reply With Quote
Old 03-20-2014, 07:21 PM   #7
susan_cassidy
Wizard
susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.
 
Posts: 2,251
Karma: 3720310
Join Date: Jan 2009
Location: USA
Device: Kindle, iPad (not used much for reading)
I would never NOT proof an OCRed document.
susan_cassidy is offline   Reply With Quote
Old 03-20-2014, 08:30 PM   #8
markom
Banned
markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.
 
Posts: 488
Karma: 1080260
Join Date: Sep 2012
Device: sony prs t1 kindle dx ipad
Quote:
Originally Posted by susan_cassidy View Post
I would never NOT proof an OCRed document.
I would never proofread an ocr-ed document because I would either use exact pdf image (ocr layer in the background) in Abbyy Finereader or clearscan in Acrobat for documents that need 100% exactness, or would use plain ocr-ed txt from Abbyy Finereader for novels and other documents that allow for a few mistakes here and there.

Last edited by markom; 03-20-2014 at 08:44 PM.
markom is offline   Reply With Quote
Old 03-20-2014, 10:31 PM   #9
AJ Starr
Guru
AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.
 
AJ Starr's Avatar
 
Posts: 815
Karma: 1029784
Join Date: May 2008
Location: Nebraska, USA
Device: PEZ, Color Libre, 2@Sony T1, Onyx i62HD
Quote:
Originally Posted by rkomar View Post
Maybe you're just guesstimating the accuracy, but 95% is not good. 95% for characters is terrible, and 95% for words is marginally acceptable. A typical printed page has something like 50 characters per line and 40 lines per page, so about 2000 characters per page. A 95% success rate per character would result in about 100 bad characters per page. A 95% success rate per word would bring that down to about 20 or 25 bad words per page. Even 99% accuracy produces more errors than most people like. You'd have to get to about 99.9% accuracy before you could think about not proofing the text afterwards.
Yes, I was guestimating, based on an entire novel. Scanning 1960's era paperbacks which are yellowed and abused. (Though I took very good care of my PB's)

I often got "1" instead of "I" or "l"; "m" instead of "r n" ; odd Hard Returns on the last line of a paragraph instead of Softreturns. So for an entire novel, 95% or better is more than acceptable to me.

(My epubs come out great!)
AJ Starr is offline   Reply With Quote
Old 03-21-2014, 04:51 PM   #10
Marcy
Guru
Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.
 
Marcy's Avatar
 
Posts: 897
Karma: 950683
Join Date: Oct 2009
Device: Kobo Libra2
Quote:
Originally Posted by AJ Starr View Post
Yes, I was guestimating, based on an entire novel. Scanning 1960's era paperbacks which are yellowed and abused. (Though I took very good care of my PB's)

I often got "1" instead of "I" or "l"; "m" instead of "r n" ; odd Hard Returns on the last line of a paragraph instead of Softreturns. So for an entire novel, 95% or better is more than acceptable to me.

(My epubs come out great!)
Sorry for hijacking the thread, but how long does it take you to do an entire paperback? How are you scanning?

The only way I see to do this with any speed is to take apart the book so the pages could be put through an ADF instead of having to turn the pages and flatten the book each time. Is that what you're doing or do you have an alternative?

Thanks,
Marcy
Marcy is offline   Reply With Quote
Old 03-21-2014, 07:25 PM   #11
wannabee
Media Bloke
wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.
 
Posts: 2,382
Karma: 113956855
Join Date: Sep 2010
Location: NSW - Australia
Device: iOS
Oh! I've never used OCR for an entire book. Mainly for documents that are easily proofed. Acrobat mixes up "1"s and "l"s too.
I remember OmniPage would flag any words not in the dictionary and provide a list of probables for you to chose. That made proofing pretty easy. I think its pretty expensive though.
wannabee is offline   Reply With Quote
Old 03-21-2014, 10:36 PM   #12
Ripplinger
350 Hoarder
Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.Ripplinger ought to be getting tired of karma fortunes by now.
 
Ripplinger's Avatar
 
Posts: 3,574
Karma: 8281267
Join Date: Dec 2010
Location: Midwest USA
Device: Sony PRS-350, Kobo Glo & Glo HD, PW2
Quote:
Originally Posted by Marcy View Post
Sorry for hijacking the thread, but how long does it take you to do an entire paperback? How are you scanning?

The only way I see to do this with any speed is to take apart the book so the pages could be put through an ADF instead of having to turn the pages and flatten the book each time. Is that what you're doing or do you have an alternative?

Thanks,
Marcy
It takes me about an hour to scan a hardcover book (typical size of around 250 pages) on a flatbed scanner. This is just an inexpensive, slow scanner by Canon, but it works very well. I just park the scanner and myself in front of the TV, set ABBYY to scan every 5-7 seconds, and just press down against the binding to flatten the double pages while it scans. That gives me enough time to turn the page and position the book and not be super rushed doing it.

The time pretty much flies if you're busy doing something else, so find a good TV show you're insterested in and you'll be done before you know it. I did my first book just standing there next to my computer and could barely stand doing 25 pages a day.
Ripplinger is offline   Reply With Quote
Old 03-22-2014, 09:10 AM   #13
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Marcy View Post
Sorry for hijacking the thread, but how long does it take you to do an entire paperback? How are you scanning?

The only way I see to do this with any speed is to take apart the book so the pages could be put through an ADF instead of having to turn the pages and flatten the book each time. Is that what you're doing or do you have an alternative?
There are a lot of methods of digitizing text at http://www.diybookscanner.org/

There is pretty much:
  • Destructive
    • Cut off binding, feed through a machine.
      • Advantage: Fast, high quality scans.
      • Disadvantage: You "lose" the book. (you just get sheets of paper out of it)
  • Non-destructive
    • Take Images using a camera
      • Advantage: Fast
      • Disadvantage: Might not be high enough resolution/DPI (may look fine to the human eye, but be inaccurate when OCRed). Depending on your setup, you may get inconsistent images.
    • Scanner
      • Advantage: High quality.
      • Disadvantage: SLOOOOOOOW

Quote:
Originally Posted by rkomar View Post
Maybe you're just guesstimating the accuracy, but 95% is not good. 95% for characters is terrible, and 95% for words is marginally acceptable. A typical printed page has something like 50 characters per line and 40 lines per page, so about 2000 characters per page. A 95% success rate per character would result in about 100 bad characters per page. A 95% success rate per word would bring that down to about 20 or 25 bad words per page. Even 99% accuracy produces more errors than most people like. You'd have to get to about 99.9% accuracy before you could think about not proofing the text afterwards.
Even "99.9%" accuracy is an unacceptable amount of errors when reading. I just completed a 430 page non-fiction economics book, the character count is 854196 characters. 99.9% accuracy means that there would be ~850 errors. I do not believe these OCR "accuracies" the companies throw out takes into account formatting errors (wrong italic/bold/superscript/subscript, ...) which get introduced as well.

Then on top of the OCR, you have to fix broken paragraphs, add in proper indentation, check for missing quotation marks, adding in blockquotes, check for actual typos/errors in the physical/PDF book, etc. etc.

I do book conversion professionally, and mostly work with non-fiction economics books (lots of footnotes). Other types of books might be eaiser/faster, but If I want to completely proof a book and get a completed/finalized EPUB out of it, it takes me ~8-15 hours of work (although when I first started it used to take me ~2 weeks to convert a book).

I explained a lot of the method in here:

https://www.mobileread.com/forums/sho...d.php?t=223817

and in here:

https://www.mobileread.com/forums/sho...d.php?t=234146

I personally use ABBYY Finereader (because in my testing it has been the most accurate). But the same methods should apply no matter what OCR program you are using.

Last edited by Tex2002ans; 03-22-2014 at 10:09 AM.
Tex2002ans is offline   Reply With Quote
Old 03-22-2014, 09:19 AM   #14
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383099
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Quote:
Originally Posted by Tex2002ans View Post
I do book conversion professionally, and mostly work with non-fiction economics books (lots of footnotes). Other types of books might be eaiser/faster, but If I want to do completely proof a book and get a completed/finalized EPUB out of it, it takes me ~8-15 hours of work (although when I first started it used to take me ~2 weeks to convert a book).
I find it difficult to believe that you can "completely proof" a book in 8-15h. "Completely proof" to me means comparing the ebook to the original source material comma by comma, letter by letter, word by word, line by line. I've been proof-reading books for many years and can't manage more than 10 pages an hour, or some 40h for a 400 page book.
HarryT is offline   Reply With Quote
Old 03-22-2014, 09:24 AM   #15
AJ Starr
Guru
AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.
 
AJ Starr's Avatar
 
Posts: 815
Karma: 1029784
Join Date: May 2008
Location: Nebraska, USA
Device: PEZ, Color Libre, 2@Sony T1, Onyx i62HD
Quote:
Originally Posted by Marcy View Post
Sorry for hijacking the thread, but how long does it take you to do an entire paperback? How are you scanning?

The only way I see to do this with any speed is to take apart the book so the pages could be put through an ADF instead of having to turn the pages and flatten the book each time. Is that what you're doing or do you have an alternative?

Thanks,
Marcy
Unlike the poster a few messages back, I take a lot longer than a couple of hours. I'm not sure how his scanner does the files, pdf or text, but my current situation is different.

I had an old (HP I think) flat bed scanner that would OCR the text and let me take it directly to my WP. On that one it didn't take long, even the paperbacks. However, it was an old XP compatible scanner and it did not get upgrades with the new OS. (I'm on Win 7 now)

My current scanner is a all-in-one and it scans nicely but not to OCR. So.......

I flatten the pb on the screen. Set the preview to identify the two different pages (making sure they are in order) then scan. I always put weights on the pb to hold it flat. It lets me continue to scan into a multi page PDF until I save the file. I will scan a chapter at a time and save it at that point. A chapter scan, depending on the number of pages and the difficulty with light leaking in where I have to rescan, takes me about 5 to 30 minutes. I will rescan a page many times if needed to get the lettering clear.

But that is just the first step for me. Then I convert the pdf to wp, edit each chapter for errors and formatting. Then convert to epub.

So anywhere for at least a full day, to a couple of weeks depending on how much I spend each time.

Hope this answers your question.

AJ

I have been looking at a portable double sided ASF scanner, a Brother 720D, lately, but haven't purchased it yet. It would necessitate taking my pbs apart and scanning page by page. Does anyone have one?
AJ Starr is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regex engine huebi Sigil 1 02-23-2012 02:53 AM
How to convert an OCR file to a Non-OCR one res9282 PDF 1 08-05-2011 05:58 AM
Search Engine alroy Calibre 1 11-06-2010 01:39 AM
Regex engine? troymc Sigil 10 07-09-2010 04:52 PM


All times are GMT -4. The time now is 06:38 PM.


MobileRead.com is a privately owned, operated and funded community.