Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 05-18-2008, 12:49 PM   #1
pepak
Fanatic
pepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura about
 
Posts: 594
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-505
OCR to use

What is the best OCR software for the purpose of converting paper books to e-books? I have just tried by new OpticBook 3600 scanner to scan a book and now I'd like to convert the images to text. I have tried the demo version of Abbyy FineReader 9 and was quite amazed at its accuracy, but I couldn't find a way to remove page headers (page number, book title, author) short of deleting them from each page manually. What software do you use?
pepak is offline   Reply With Quote
Old 05-18-2008, 01:02 PM   #2
wayrad
Fanatic
wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.
 
Posts: 547
Karma: 1121392
Join Date: May 2008
Location: USA
Device: Galaxy Nexus
Quote:
Originally Posted by pepak View Post
I have tried the demo version of Abbyy FineReader 9 and was quite amazed at its accuracy, but I couldn't find a way to remove page headers (page number, book title, author) short of deleting them from each page manually.
I also tried that demo, and I believe there was an option for removing headers and footers when saving to Word format, at least. It was somewhere in the save options.

Unfortunately the trial version dosn't let you save more than a page at a time. If I'd been able to verify that it could save multiple pages without headers/footers and properly join up sentences that continued over a page break, I might've bought it, but as it was I couldn't justify upgrading from v. 8.

In version 8 the fastest way I've found for deleting headers and footers is a page crop, but if anyone knows a better way I'd love to hear it.
wayrad is offline   Reply With Quote
Old 05-18-2008, 01:42 PM   #3
pepak
Fanatic
pepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura about
 
Posts: 594
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-505
I sent an email to Abbyy, asking about page headers as well as the page-spanning paragraphs. If they answer, I'll paste it here.
pepak is offline   Reply With Quote
Old 05-18-2008, 02:08 PM   #4
wayrad
Fanatic
wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.
 
Posts: 547
Karma: 1121392
Join Date: May 2008
Location: USA
Device: Galaxy Nexus
Come to think of it, I also had the impression that despite the additional features, in some ways v.9 was not quite as good as v.8, at least on the material I scanned. I seem to remember that the table of contents needed more cleaning up and there were a few more OCR errors. But without slugging through a whole book it's hard to tell - and I wasn't about to do that once I found out I couldn't save it.
wayrad is offline   Reply With Quote
Old 05-18-2008, 04:00 PM   #5
RWood
Technogeezer
RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.
 
RWood's Avatar
 
Posts: 7,233
Karma: 1596436
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
I have often scanned to PDF and then in Adobe Acrobat clipped the top and bottom of the pages to remove the headers and footers before OCR conversion. ABBYY is a great product.
RWood is offline   Reply With Quote
Old 05-20-2008, 06:15 PM   #6
DDHarriman
Guru
DDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheese
 
Posts: 851
Karma: 1200
Join Date: Feb 2008
Location: Almada, Portugal
Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note
You are asking more then one question.

First, to the “what’s the best one”:
The market leaders are Omnipage pro (now in the 16th version) and Finereader pro (now in the 9th version).

Both are outstanding and both have the possibility to take out headers and footers (inclusive page numbers), but… and there is always a but, it as to do with the format you save your OCR’ed file.

One possibility is saving in text format, even so, it’s not 100% accurate.
I have got situations on getting more then 95% of them out with some projects.

If you have the money, go with one of these 2, you will not regret it, and with the OpticBook you have the perfect twin machine to convert books.
DDHarriman is offline   Reply With Quote
Old 05-21-2008, 01:07 AM   #7
pepak
Fanatic
pepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura about
 
Posts: 594
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-505
Quote:
Originally Posted by DDHarriman View Post
Both are outstanding and both have the possibility to take out headers and footers (inclusive page numbers), but…
FineReader support confirmed that the only way to remove the headers would be through templates, but the templates rely on each page having exactly the same layout. Which is quite difficult to achieve (there is always a shift of a milimeter or two). I solved the problem by writing a quick and dirty application which searches for the first line of texts and overwrites it with white. It worked quite fine on my test book.

Quote:
If you have the money, go with one of these 2, you will not regret it, and with the OpticBook you have the perfect twin machine to convert books.
Thanks. I will see if Omnipage Pro has a demo version, try it, and then decide which one to get.
pepak is offline   Reply With Quote
Old 05-21-2008, 07:59 AM   #8
DDHarriman
Guru
DDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheese
 
Posts: 851
Karma: 1200
Join Date: Feb 2008
Location: Almada, Portugal
Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note
One argument is price, and Finereader costs something between 1/2 to 1/3 of Omnipage.
DDHarriman is offline   Reply With Quote
Old 05-21-2008, 08:18 AM   #9
pepak
Fanatic
pepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura about
 
Posts: 594
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-505
Actually, the home version of Omnipage is a lot cheaper than FineReader (150 USD vs. 199 EUR - or something like that, for some reason I can't open Abbyy's store now).

But since Omnipage apparently doesn't have a trial version, the choice is clear. I'll go with FineReader.
pepak is offline   Reply With Quote
Old 05-21-2008, 09:41 AM   #10
DDHarriman
Guru
DDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheeseDDHarriman can extract oil from cheese
 
Posts: 851
Karma: 1200
Join Date: Feb 2008
Location: Almada, Portugal
Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note
Yes but even so, the comparable version of the products are Omnipage pro and Finereader pro.

Finereader pro 159 euros download (price for to western Europe)
Omnipage pro 499 US$ (international shop), or around 316,76 euros (at today's exchange rate of 1,5753).

Anyway, I know you will be outstanding well served with any of the two programs.
DDHarriman is offline   Reply With Quote
Old 05-21-2008, 09:49 AM   #11
pepak
Fanatic
pepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura about
 
Posts: 594
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-505
The only problem is that I've just learned about FineReader's Activation - something I absolutely hate. I guess I'll have to look for another OCR program, one which might be worse in functionality, but won't bother me with activation.
pepak is offline   Reply With Quote
Old 05-21-2008, 11:28 AM   #12
RWood
Technogeezer
RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.
 
RWood's Avatar
 
Posts: 7,233
Karma: 1596436
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
ABBYY also has a Scan to Office product that runs $49 (US) that takes the images from your scanner and converts them to text. There is a trial version that you might want to check out.
RWood is offline   Reply With Quote
Old 05-21-2008, 07:39 PM   #13
AJ Starr
Guru
AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.
 
AJ Starr's Avatar
 
Posts: 814
Karma: 1029784
Join Date: May 2008
Location: Nebraska, USA
Device: PEZ, Color Libre, 2@Sony T1, Onyx i62HD
Scanners..

Years and years ago, right after flatbed scanners hit the market, my husband bought me an HP Scanjet 5300C. When he had asked what I wanted I told him just make sure it had OCR. Well it did and it's great.

After trying several ways to scan, i.e, to pdf, or page by page to my word processor (I use Word Perfect) I found the easiest way was to drag and drop the scanned image, making sure it was in Text form not image form, onto my WP. The scanner uses it's OCR to convert before dropping onto my WP and I put the whole book in one file.

However when I needed a new printer (about every other year) I bought a all-in-one with scanner. (Epson Stylus) Though it does multipage PDF's, it does not do OCR.

So be warned to verify scanner have OCR.

Alyson
AJ Starr is offline   Reply With Quote
Old 05-21-2008, 07:57 PM   #14
wayrad
Fanatic
wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.wayrad ought to be getting tired of karma fortunes by now.
 
Posts: 547
Karma: 1121392
Join Date: May 2008
Location: USA
Device: Galaxy Nexus
Quote:
Originally Posted by AJ Starr View Post
So be warned to verify scanner have OCR.
Yes, even though the OCR software bundled with scanners (scanners don't "have OCR" built in) is often a "lite" version (eg Finereader Sprint), having any OCR package is enough to qualify you for upgrade pricing on Finereader Pro.

Last edited by wayrad; 05-22-2008 at 07:30 AM.
wayrad is offline   Reply With Quote
Old 05-26-2008, 06:39 AM   #15
Nergal
eBuchReisender
Nergal doesn't litterNergal doesn't litterNergal doesn't litter
 
Nergal's Avatar
 
Posts: 41
Karma: 208
Join Date: May 2008
Location: Münster
Device: Palm Tungsten-E, iLiad
For the inital question: I recommend to have a look at tesseract ocr - it is an opensource command line tool - with an amazing recognition rate (95-99.9 %, mostly at 98-99% for me). It was developed by HP back in the mid 90's and is now based at google.

http://code.google.com/p/tesseract-ocr/

It is a bit rough to use, but english and german and several others are supported so far - I do not know wether there are differences between the languages in the result quality.

It has NO layout recognition - give it a simple grayscale tif-image (no compression) and it'll spit out UTF-8 encoded plain text with line ends.

ATM I program a little Python/Qt tool to create eBooks from my paperbacks, which runs with my Epson Flatbed USB quite well.

Have a look at the post in my blog (It's German, but simply click on the download link in the post if you cannot understand the text ) - I had no time yet to write a manual and some options are still missing int he gui (have look in the bookscan.py-file at line 106 - it will scan by default 2 pages from a book (I recommend Reclam or Penguin books for the first testing, since no rotation is implemented yet), so set the maxpages-value to half the amount of bookpages you want to have.

With a preview the app is horribly slow - if you really want to scan a whole book, have the appropriate values for the part to be scanned at hand in mm. So far the two pages are simply separated from each other by saving the right and the left half of the scanned image into separate files.

Huh ... send me an email (nergal[ät]monasteriaobscura[punkt]de if anything is weird.

The version is, well something below 0.1a

Cheers,
Nergal
Nergal is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Ebook readers - should you OCR or not? crackhammer Calibre 13 09-06-2010 02:32 AM
OCR Software Help kpfeifle Workshop 5 03-01-2010 02:27 PM
Unutterably Silly Memorable OCR errors Patricia Lounge 4 02-16-2010 02:53 PM
OCR help needed Nate the great Workshop 7 09-21-2009 11:21 PM
What is an OCR Cradle? JackieFrost Which one should I buy? 4 05-21-2008 08:10 PM


All times are GMT -4. The time now is 06:12 AM.


MobileRead.com is a privately owned, operated and funded community.