![]() |
#1 | ||
Obsessively Dedicated...
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,211
Karma: 34984682
Join Date: May 2011
Location: JAPAN (US expatriate)
Device: Sony PRS-T2, ADE on PC
|
Bad OCR... When spellcheck won't help
Some recent stories I worked on, I had to use a free OCR service, it gave results like these:
the first was very tightly kerned, and the OCR got all the characters right, but not the word breaks. Quote:
Quote:
Abby Finereader is not in my future, sorry to say. Is there a free software that might give better results than I got here? Regex is absolutely my weak spot, but does anyone have any suggestions for the next time I run against this type of situation? Last edited by GrannyGrump; 09-19-2015 at 10:12 AM. |
||
![]() |
![]() |
![]() |
#2 |
a toy panda
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,568
Karma: 26020474
Join Date: Mar 2014
Location: Onboard the Queen Anne's Revenge
Device: Various Android dvices
|
If you got MS office, the one with onenote. You can use it to OCR. With my attempts I had about 80-85% success. But it depends on the quality.
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
I use tesseract, which gives decent to very good results if the scans are half-decent. ABBYY gives appreciably better results, admittedly, but I haven't found anything better that's free. I use regexp quite a lot for initial OCR cleanup, I'll see if I can't find a list of standard expressions somewhere. Another trick I often use is word frequency, words that only occur once or twice are pretty often suspect. But I'd be stumped at output like the ones you show. May I ask what OCR you used?
|
![]() |
![]() |
![]() |
#4 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
These are the things I normally look for in tesseract output:
'fi' and 'fl' often get mixed up. Still haven't found the best pattern here, but /fi[oaie]/ definitely catches a few dodgy ones. /[a-zA-Z][0-9]/ : a letter followed by a digit is pretty dodgy, though normally this only happens with 0 and 1. /[a-z][A-Z]/ : a lower-case followed by an upper-case / [,.?!:]/ : Whitespace before punctuation. /\b[bcdfhjklmnopqrstuvwxyz]\b/ : Single-letter words, excluding a, e.g., i.e. /\b\(tl\|nr\|rr\)/ : Impossible beginnings to English words. Any linguists care to expand? Last edited by SBT; 09-19-2015 at 05:23 PM. |
![]() |
![]() |
![]() |
#5 |
Obsessively Dedicated...
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,211
Karma: 34984682
Join Date: May 2011
Location: JAPAN (US expatriate)
Device: Sony PRS-T2, ADE on PC
|
@Panda, I don't have Office, but maybe can use a friend's computer next time to try the OneNote version.
@SBT -- I used the OCR that is included in PDFXchange reader -- since it is freeware, it might even use the tessaract engine. I will hunt down and try tessaract for the next time I hit one of these. It couldn't give me worse output than this, I think. Thank you so much for the Regex patterns. I bet they will be useful when I am dealing with OCR from archive.org. Thanks again to you both. |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Imperfect Perfectionist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 635
Karma: 863576
Join Date: Dec 2011
Location: Ølstykke, Denmark
Device: none
|
There actually exists tools to split texts like the first sample into legible words (!), but they seem to be very specialized and out of the league of ordinary people like you and me (I haven't tried any) - eg. wordsplit or wordsegment.
Regarding the second one you might get something from the PepitoCleaner extension (general cleanup) and the Linguist extension (finding the most common misspellings, so they can be corrected by find-and-replace) - if you have LibreOffice. (Don't let the silly logo of Pepito Cleaner put you off - it's actually a very good tool - and you can also put in your own regex'es if needs be.) Regards Kim |
![]() |
![]() |
![]() |
#7 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
@elibrarian: Thanks for the tip on the LibreOffice extensions.
Meanwhile, I've started looking into a particular problem of mine. "My" ebooks are mostly 19th century Norwegian books, using spelling and grammar that's somewhere half-way between Danish and modern Norwegian, meaning I can't use spell-checkers, because no pre-1907 Norwegian ispell dictionary exists. However, a lot of proper proof-read digital 19th c. Norwegian texts exist (>10,000 pages). I came across this 21-line(!) spell-checker at norvig.com, the site more famous as the origin of the PowerPoint version of the Gettysburg address ![]() |
![]() |
![]() |
![]() |
#8 |
Junior Member
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7
Karma: 380010
Join Date: Sep 2015
Location: New York
Device: none
|
Quality output depends on the quality input i.e In-outed images should have min 600 ppi resolution with properly scanned and cleaned images. Would recommend Tesseract OCR as an alternate.
|
![]() |
![]() |
![]() |
#9 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 238
Karma: 1500000
Join Date: Nov 2009
Location: Toronto
Device: Pandigital Novel (Black), T-2 and 3, Nexus 7
|
@Granny:
Do you know someone with a scanner. Most scanners come with an OCR program, most likely ABBYY Sprint. My Epson V500 came with Sprint 6.0 and the DS-30 came with Sprint 9.0. Both work well but in my experience they work better with Windows XP than 7. They do work well if you have W7 Pro and run them as XP programs. And there is ABBYY version 4 that was on the cover disc for an issue of PC Plus in the early 2000s. |
![]() |
![]() |
![]() |
#10 | |
Fuzzball, the purple cat
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,299
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
|
|
![]() |
![]() |
![]() |
#11 | |
a toy panda
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,568
Karma: 26020474
Join Date: Mar 2014
Location: Onboard the Queen Anne's Revenge
Device: Various Android dvices
|
Quote:
|
|
![]() |
![]() |
![]() |
#12 |
Fuzzball, the purple cat
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,299
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Add Spellcheck Dictionary | Leonatus | Editor | 6 | 02-12-2015 12:16 PM |
Request for future spellcheck | mrmikel | Editor | 1 | 03-21-2014 11:42 AM |
How to convert an OCR file to a Non-OCR one | res9282 | 1 | 08-05-2011 05:58 AM | |
SPELLCHECK NATION: Does SpellCheck have a dark side? | cbaehr | Self-Promotions by Authors and Publishers | 10 | 11-07-2010 12:45 PM |
Calibre won't open after install -- ERROR: Bad database location | htbyron | Calibre | 3 | 06-17-2010 10:43 AM |