04-27-2011, 11:47 AM | #1 |
Guru
Posts: 932
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
|
Best OCR program
I am considering buying an OCR program for converting my books to epubs, and have come up with two alternatives: OmniPage 17 and Abbyy Fine Reader 10. Does anyone have tips on which one to go for? I'm aiming for accuracy: the less time spent on formating files before compilation to epub, the more time for proofreading. The less errors found during proof reading, the less time spent in correcting them.
I am going to try both before I eventually buy one of them, but was wondering what program is used "out there" by the rest and more experienced, and if anybody have any recomandations or ideas of what I should be looking for to get the best result possible from my scanner and my books. Last edited by Iznogood; 04-27-2011 at 11:53 AM. Reason: Fixing typo |
04-27-2011, 02:34 PM | #2 |
Guru
Posts: 860
Karma: 4380
Join Date: Feb 2008
Location: Almada, Portugal
Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note
|
Hello
You will have people who will say one and people who say the other. I personally think both are really good. Still you will have a very good reason to choose Finereader instead of Omnipage - Finereader is much cheaper! Also if you have got Finereader with a scanner (even FineReader Sprint 4.0 qualifies) you can upgrade it online for 99 euros (if you are in the EU). This is the reason I’m pointing most of my clients to Finereader. But you have decided well: test both and decide by yourself. About tips… I advise you first to try the forum as it as plenty of tips and advices. Best regards, |
Advert | |
|
04-27-2011, 02:42 PM | #3 |
Wizard
Posts: 3,450
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
|
Fine reader seems to be much more popular with the book scanning community.
On those rage occasions when I need to use OCR I use Readiris at work (at home I run on Linux and Open Source solutions aren't quite there, yet). We have purchased cheap scanner/printer/copier/fax combo from HP and the Readiris software was thrown in at no extra charge. Long time ago I used to use Recognita that was very good at its time and it also came as a bundle with an Umax scanner. So try to look around. Perhaps you could get a good deal buying some bundle. |
04-27-2011, 06:24 PM | #4 |
Addict
Posts: 300
Karma: 1006538
Join Date: Jul 2008
Device: Kindle Paperwhite (11th Gen)
|
Just as an a side note, if you have access to MS Office 2007 you could just load up the OCR portion only and use that to scan tiffs etc.
It works VERY well. I use it a lot, that combined with a program called tiff joiner works wonders. I scan each page as a tiff, use joiner to make one large image ( one note tho, copy the images to a new directory first, seems that tiff joiner for me at least removes the origninal images when it creates the new combined file, so work with copies ), then load it into the MSOffice ocr and presto change oh .. your set. |
04-28-2011, 09:55 AM | #5 |
Chocolate Grasshopper ...
Posts: 27,600
Karma: 20821184
Join Date: Mar 2008
Location: Scotland
Device: Muse HD , Cybook Gen3 , Pocketbook 302 (Black) , Nexus 10: wife has PW
|
I use a free version of omnipage that came with printer scanner, years ago - works well within the limitations of the 'freeness' - the paid for version ought to be OK.
|
Advert | |
|
04-29-2011, 04:02 AM | #6 |
Evangelist
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
There was an article on Lifehacker a few months ago on which OCR program is the best and ABBYY FineReader had the most votes. I agree, but no OCR program is perfect. Far from it. Especially if the printed material was of poor quality and inconsistent throughout the book, you could end up with entire pages of bold text...
From what I've experienced, FineReader has some issues with Romanian (like not detecting the capital "î", wrong quote marks, etc), but those can be fixed either while proof-reading the whole thing or by doing a batch replace. For instance replacing: ". î" with ". Î" "? î" with "? Î" "! î" with "! Î" But it's not always a good idea. Sometimes it will mess up the indentation. For instance if a comma (,) was mistook for a period (.) then the whole phrase would be split and could have a whole different meaning. If you find recurring inconsistencies with certain characters in your language you should do a manual search and selectively replace them instead of doing a batch replace. |
04-30-2011, 05:43 AM | #7 |
Enthusiast
Posts: 49
Karma: 14
Join Date: Jul 2010
Location: Harrogate, England
Device: iPad
|
I looked at OmniPage and FineReader. I selected FineReader after some informal testing which indicated OmniPage was more likely to end up with pages of incorrect bold.
Since then I've also seen this effect in FineReader so my coarse cut may have been too coarse! I've converted about 450 books with FineReader and I've (re-)read/proofed maybe 70 of them. My impressions from this (I've not yet formalised this) is that FineReader has a few systematic errors. It often sees tl as d (but this depends on the book) It struggles with when a symbol is I or 1 (eye or one ) It is (relatively) poor at getting paragraph breaks right It is (relatively) poor at de-hyphenating words on line breaks It struggles more with books where the paper has darkened (poor contrast ratio) It can get the formatting confused - I think this mainly happens when the page is scanned at an angle - somehting which is hard to eliminate. It has trouble with italic and especially exclamation marks. Punctuation is a bit dodgy. In particular quotes marks are probably often missed out. I can't say that I notice this a huge amount since it is usually very clear from context what is happening However, when you get a long dialogue where the speaker change is only indicated by the quotation marks, this can be a bit troublesome. Having said all that, for probably 9 books out of 10 the conversion is sufficiently close to perfect that I need to be in nit-picky mode to find errors - there might be a mistaken character every ten pages or so. (paragraph and hyphenation errors are more common than this, but I've preprocessed the FineReader output to correct most of these). There's probably half a dozen books that are a struggle to read. In those I've investigated so far, the problem seems associated with poor contrast ratio. For example I have the Herris Serrano series. The first few are barely readable. The last few are nearly perfect. They are all from the same house with the same basic layout font and size. The difference is that the first batch are quite seriously browned. I'm hoping to do a broader comparison of the main OCR players sometime in the next few months and will post my results! Iain |
05-02-2011, 12:15 AM | #8 |
Plugh
Posts: 15
Karma: 100000
Join Date: May 2011
Location: Seattle
Device: Pandigital and Kindle3
|
OmniPage 17 does excellent recognition, probably better than FineReader.
But when it comes to scanning and reading books, FineReader 10 Pro is the no-brainer choice. (Well, that's just my opinion.) With an Epson GT-S50 (duplex and ADF) plugged into FR10 I can scan a 200 page book in about ten minutes. |
05-16-2011, 11:28 PM | #9 |
Fanatic
Posts: 527
Karma: 1048576
Join Date: May 2009
Device: bebook; prs-950; nook simple touch; HTC Jetstream tablet
|
radiotales,
200 pages in 10 minutes?? I looked at a web page image of the Epson GS-S50 and it looks like it is not a flatbed scanner. Are you disassembling the book first? |
05-17-2011, 05:51 AM | #10 |
Evangelist
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
Probably. Looks like a sheet feeder, so yeah.
You essentially strip the book spine with a saw and feed each page twice to scan both sides. It ruins the book but you also get straighter scans at a faster speed than using a flatbed scanner. If you're thinking of putting the book back the way it was... will be very tricky. You could use some kind of resin but it won't be the same. Oh, and it doesn't help OCR-ing as much as you'd think... Well, maybe if you're using a cheap scanner that can't see well between the pages. |
06-07-2011, 06:28 AM | #11 |
Guru
Posts: 932
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
|
Thanks for all tips and advice. I have done some informal testing with both ABBYY Finereader 10 (evaluation program) and Omnipage Pro 17 (purchased, refundable if not satisfied).
Setup: Scanned one chapter in three different books, on norwegian, one english and one english with a lot of ñ, á, ó, italics and so on to test detection of these. I do not have the heart to cut the books free from the spine , and have only a flatbed scanner, so the scanned images of course are far from perfect. Each image file contains two pages of the book to save time scanning. I loaded the test chapters in ABBYY and Omnipage, and let each program chew on them until they spat out some text files. ABBYY was instructed to split dual pages, preprocess and deskew image and to detect page orientation. Omnipage did not have these options in load mode. I do not know whether these are default in omnipage, but the images were split correctly, so I believe that both programs performed the same operations. Text recognition were done in automatic mode in both programs. Since a book consists of several hudred scans, I do not want to draw zones and do image processing on single images. I just pressed the "perform OCR"-button. None of the programs were allowed training. Since I had only one chapter of each book, training would not have resulted in anything. Results: As have been pointed out, both these programs are very accurate, but I found that ABBYY was the most accurate in this case. It did not detect á, é and ò, but it detected emphasized text pretty well (the same yields for Omnipage btw), but ABBYY had less errors overall (I did not count errors, but there were significantly fewer errors in ABBYY). ABBYY also came out best on detecting possible errors. When it came to proofreading, I find ABBYYs layout more appealing and Omnipages (but I guess that is a matter of preference, and not part of OCR testing. I also liked ABBYYs feature of automatic scanning (i.e. I tell the program to scan a page, wait for X seconds on completion while I change page in the book and places it on the scanner, and the program will scan the next page without me telling it to), so I will be asking for a refund for Omnipage and buying a license of ABBYY instead. Note to report: Since the trial version of ABBYY is limited, not all the feature these programs have are tested, and aldo I was not able to check how well ABBYY performed on page breaks. In my case it doesn't matter, because I want to edit page breaks manually (I use a script to insert <span class="newpage" id="pageXXX"/> where XXX is the page number on all page breaks) |
06-07-2011, 07:06 AM | #12 |
Guru
Posts: 932
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
|
My next post is a little bit off topic, but I take my chances of hijacking my own thread to ask a few questions about the step after OCR, namely proofreading. If the moderators find this unapropriate, I ask for forgiveness and agree that this is moved to a new thread.
I like ABBYYs interface for proofreading, where I can put the cursor in the text document, and the program highlights the corresponding character in the scanned image so that I easily can check if ABBYY recognized that character correctly (image attached). I have a few books that I really, really want to have in digital form, and I am willing to do a good deal of work to get the results as near as perfect as possible for these books. First book out is Death-Watch, which is done as an experimental book. My current workflow:
My goal is to get the results as perfect and as close to the original as possible. Even the typos of the original and a page breaks (span class="newpage" with property display:none) are lovingly preserved. And, as a nitpicker, I also require good markup, and no computer tool is good at applying proper tagging. ABBYY recognizes all headlines as <p>. To get them stored as <hx>, I have to do that manually in the editor, or do it manually in the html file. I prefer the latter. Nevertheless, I find steps 1 to 9 abowe very time consuming. Especially is the steps 6 to 8 unneccessary redundant, as far as I can see. First, I proofread the text, and thereafter I have to go through it all again in the HTML file to ensure proper markup and formatting. I only do step 6 because it is much easier to do it in ABBYY instead of having two open windows on my desktop and manually try to compare the HTML code and the source image. More errors escape me that way. I would like to do this in one step: export directly to HTML, split html code to chapters, create .ncx and .opf files, cover files and so on, and thereafter proofreading the html code and compare it to the source image. Does anybody know of a program that kan keep the source image (either as images or as searchable PDf files with recognized text under the image) in sync, so that I don't have to read the html code, and forget my position in it before I find the corresponding position in the source and can compare them? The same feature as ABBYYs proofreader would be nice: highlighting of the character or word now marked in the code. ABBYY would be perfect for proofreading if it allowed to change markup and regex search and replace, but I fear it is a waste of time to proofread before the final version, i.e. before quotations, markup and so on is corrected. Anybody with tips? How do you professionals perform proofreading? Last edited by Iznogood; 06-07-2011 at 07:36 AM. Reason: Typo pluss missing attachment |
06-09-2011, 04:12 AM | #13 |
Evangelist
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
Scanning tips:
Pages: 300 dpi (grayscale) 85-90% JPG quality for a much lower filesize (250-450 MB, depending on the book) Covers: 600 dpi (colour) TIFF or PNG because you'll probably want to crop, deskew it, increase/decrease saturation, etc., so it's best if you have something good to work with. Scanning the pages at 300 dpi instead of 600 will go much faster and believe it or not, it's actually better to scan at a lower resolution because the higher it is, the more chances there are that the OCR process will see various imperfections as commas (,) or add periods (.) where there aren't any just because the book had a small printing smudge somewhere. No OCR solution has provided a 100% accurate output - and it probably never will as long as there are printing flaws in basically any book (except maybe for the recent prints). You see, statistically, the more the pages there are in a book, the more chance there is that there's going to be at least one flaw in one paragraph somewhere in the book. So in order to provide a pleasant reading experience, proofreading is essential. Do an initial "grunt work" sweep in FineReader and correct any issues as you go along. Batch replace quotation marks (but never batch replace commas and periods), then do a second "pleasure" proofreading for the (semi)final version. If you find anything out of place you can highlight it inside the reader for a future source review. If you'd like to match it against a scanned image, you could try setting the window transparency and overlap the windows to spot the difference. Nvidia drivers can make a window transparent (don't know about ATI drivers) using a combination of a hotkey+mouse scroll. Or you could use software such as Actual Window Manager or just Actual Transparent Window: http://www.actualtools.com/products/ (shareware). I once used this method to match line spacing for a PDF document. Don't know if you'll get the same result with an ePub, given it's free-flow nature... Works best for PDF that uses the original print line breaks and the exact same font. Last edited by DSpider; 06-09-2011 at 04:14 AM. |
06-09-2011, 04:15 AM | #14 |
Evangelist
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
For white balance and pre-processing, you could use Scan Tailor, though I haven't really felt a need for it. Maybe if you output straight to (Exact Layout) PDF. I usually export to .docx and edit the content and margins, etc., in Word 2010. From there you could export to HTML or, my personal favourite, PDF.
Regarding quotation marks, Ctrl+H brings up the Search and Replace menu. You should give it a shot. For example, in Romanian only one quotation mark seems to always be off, no matter the typeface. So I batch replace " (Shift + ') with „ or ” (I always forget which). PS: Sorry for the double post. I figured it looks better this way instead of a single wall of text. |
06-23-2011, 08:47 PM | #15 | |
Banned
Posts: 242
Karma: 51054
Join Date: Jun 2011
Location: Belleville, IL
Device: Kindle-3
|
Quote:
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Do I have to OCR? | Ceryta | Workshop | 7 | 05-07-2011 11:03 AM |
OCR on a reader? | recycledelectron | General Discussions | 18 | 03-10-2011 07:34 PM |
hindi ocr help | asdx | Workshop | 0 | 12-18-2010 02:24 PM |
OCR help needed | Nate the great | Workshop | 7 | 09-21-2009 11:21 PM |
OCR to use | pepak | Workshop | 17 | 05-26-2008 05:30 PM |