Bad OCR... When spellcheck won't help

GrannyGrump · 09-19-2015, 11:09 AM

Some recent stories I worked on, I had to use a free OCR service, it gave results like these:

the first was very tightly kerned, and the OCR got all the characters right, but not the word breaks.

Quote:

Thenthroughanotherkitchen,
where redrustwasmaking itsfull
mealof a comparativelymodernrange.
Then into the greathall wherethe
old armorandthebuff-coatsandround
capshungonthewalls,andwherethe
carvedstonestaircasesran at eachside

Another was at least 75% gibberish.

Quote:

and tho enorrnous vuoee of solid silver, we
heavy for l1i1n to 1il't—r_-.ve1r these were hie-
hrrd lrc not found tlrer|r—he, by his own skill
rr,||r_i tunnirrg P He went about In Llro rooms,
wurrlriirg one after the other the beautiful,
mre things. Hr: oun:ssr:1.l the gold and the
jrrwele. He tlrrerr his nrms round the great
silver vnsos; he wound round lrirnself tho
l'iun\"j rod velvet of the crltlnllr Wlroro tllo
grithrrsgleaured in embossed goldnrnd shi ‘~'IJ1‘Ei.l

I had to pretty much manually go through the first and force word breaks, and manually transcribe the second.

Abby Finereader is not in my future, sorry to say. Is there a free software that might give better results than I got here?

Regex is absolutely my weak spot, but does anyone have any suggestions for the next time I run against this type of situation?

PandathePanda · 09-19-2015, 01:26 PM

If you got MS office, the one with onenote. You can use it to OCR. With my attempts I had about 80-85% success. But it depends on the quality.

SBT · 09-19-2015, 06:08 PM

I use tesseract, which gives decent to very good results if the scans are half-decent. ABBYY gives appreciably better results, admittedly, but I haven't found anything better that's free. I use regexp quite a lot for initial OCR cleanup, I'll see if I can't find a list of standard expressions somewhere. Another trick I often use is word frequency, words that only occur once or twice are pretty often suspect. But I'd be stumped at output like the ones you show. May I ask what OCR you used?

SBT · 09-19-2015, 06:21 PM

These are the things I normally look for in tesseract output:
'fi' and 'fl' often get mixed up. Still haven't found the best pattern here, but
/fi[oaie]/ definitely catches a few dodgy ones.
/[a-zA-Z][0-9]/ : a letter followed by a digit is pretty dodgy, though normally this only happens with 0 and 1.
/[a-z][A-Z]/ : a lower-case followed by an upper-case
/ [,.?!:]/ : Whitespace before punctuation.
/\b[bcdfhjklmnopqrstuvwxyz]\b/ : Single-letter words, excluding a, e.g., i.e.
/\b\(tl\|nr\|rr\)/ : Impossible beginnings to English words. Any linguists care to expand?

GrannyGrump · 09-20-2015, 01:02 AM

@Panda, I don't have Office, but maybe can use a friend's computer next time to try the OneNote version.

@SBT -- I used the OCR that is included in PDFXchange reader -- since it is freeware, it might even use the tessaract engine. I will hunt down and try tessaract for the next time I hit one of these. It couldn't give me worse output than this, I think.
Thank you so much for the Regex patterns. I bet they will be useful when I am dealing with OCR from archive.org.

Thanks again to you both.

elibrarian · 09-20-2015, 05:32 AM

There actually exists tools to split texts like the first sample into legible words (!), but they seem to be very specialized and out of the league of ordinary people like you and me (I haven't tried any) - eg. wordsplit or wordsegment.

Regarding the second one you might get something from the PepitoCleaner extension (general cleanup) and the Linguist extension (finding the most common misspellings, so they can be corrected by find-and-replace) - if you have LibreOffice.

(Don't let the silly logo of Pepito Cleaner put you off - it's actually a very good tool - and you can also put in your own regex'es if needs be.)

Regards

Kim

SBT · 09-20-2015, 08:40 AM

@elibrarian: Thanks for the tip on the LibreOffice extensions.

Meanwhile, I've started looking into a particular problem of mine.
"My" ebooks are mostly 19th century Norwegian books, using spelling and grammar that's somewhere half-way between Danish and modern Norwegian, meaning I can't use spell-checkers, because no pre-1907 Norwegian ispell dictionary exists. However, a lot of proper proof-read digital 19th c. Norwegian texts exist (>10,000 pages).

I came across this 21-line(!) spell-checker at norvig.com, the site more famous as the origin of the PowerPoint version of the Gettysburg address

. Based on a huge reference text, it checks spelling of a word. Though oriented towards human errors like transposition, it should be possible to tweak it to look for typical OCR mistakes, like 'f00lish' and 'junip'

Kennth · 10-19-2015, 08:51 AM

Quality output depends on the quality input i.e In-outed images should have min 600 ppi resolution with properly scanned and cleaned images. Would recommend Tesseract OCR as an alternate.

grumbles · 10-20-2015, 02:18 AM

@Granny:

Do you know someone with a scanner. Most scanners come with an OCR program, most likely ABBYY Sprint. My Epson V500 came with Sprint 6.0 and the DS-30 came with Sprint 9.0. Both work well but in my experience they work better with Windows XP than 7. They do work well if you have W7 Pro and run them as XP programs. And there is ABBYY version 4 that was on the cover disc for an issue of PC Plus in the early 2000s.

willus · 10-21-2015, 11:04 PM

Quote:

Originally Posted by GrannyGrump

@Panda, I don't have Office, but maybe can use a friend's computer next time to try the OneNote version.

@SBT -- I used the OCR that is included in PDFXchange reader -- since it is freeware, it might even use the tessaract engine. I will hunt down and try tessaract for the next time I hit one of these. It couldn't give me worse output than this, I think.
Thank you so much for the Regex patterns. I bet they will be useful when I am dealing with OCR from archive.org.

Thanks again to you both.

FYI, you can process a scanned PDF with k2pdfopt to OCR it with the tesseract engine, which is built into k2pdfopt (free software).

PandathePanda · 10-22-2015, 12:22 AM

Quote:

Originally Posted by willus

FYI, you can process a scanned PDF with k2pdfopt to OCR it with the tesseract engine, which is built into k2pdfopt (free software).

Tried it and it failed to "convert" the one PDF I'm slowly OCRing, it does the first to pages then it it fails

willus · 10-22-2015, 09:42 AM

Quote:

Originally Posted by PandathePanda

Tried it and it failed to "convert" the one PDF I'm slowly OCRing, it does the first to pages then it it fails

Are you able to PM me a copy of the source PDF that fails (and the settings / platform you are using)? If so, I will investigate.

09-19-2015, 06:21 PM	#4
SBT Fanatic Posts: 580 Karma: 810184 Join Date: Sep 2010 Location: Norway Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad	These are the things I normally look for in tesseract output: 'fi' and 'fl' often get mixed up. Still haven't found the best pattern here, but /fi[oaie]/ definitely catches a few dodgy ones. /[a-zA-Z][0-9]/ : a letter followed by a digit is pretty dodgy, though normally this only happens with 0 and 1. /[a-z][A-Z]/ : a lower-case followed by an upper-case / [,.?!:]/ : Whitespace before punctuation. /\b[bcdfhjklmnopqrstuvwxyz]\b/ : Single-letter words, excluding a, e.g., i.e. /\b\(tl\\|nr\\|rr\)/ : Impossible beginnings to English words. Any linguists care to expand? Last edited by SBT; 09-19-2015 at 06:23 PM.

09-20-2015, 08:40 AM	#7
SBT Fanatic Posts: 580 Karma: 810184 Join Date: Sep 2010 Location: Norway Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad	@elibrarian: Thanks for the tip on the LibreOffice extensions. Meanwhile, I've started looking into a particular problem of mine. "My" ebooks are mostly 19th century Norwegian books, using spelling and grammar that's somewhere half-way between Danish and modern Norwegian, meaning I can't use spell-checkers, because no pre-1907 Norwegian ispell dictionary exists. However, a lot of proper proof-read digital 19th c. Norwegian texts exist (>10,000 pages). I came across this 21-line(!) spell-checker at norvig.com, the site more famous as the origin of the PowerPoint version of the Gettysburg address. Based on a huge reference text, it checks spelling of a word. Though oriented towards human errors like transposition, it should be possible to tweak it to look for typical OCR mistakes, like 'f00lish' and 'junip'

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Add Spellcheck Dictionary	Leonatus	Editor	6	02-12-2015 01:16 PM
Request for future spellcheck	mrmikel	Editor	1	03-21-2014 12:42 PM
How to convert an OCR file to a Non-OCR one	res9282	PDF	1	08-05-2011 06:58 AM
SPELLCHECK NATION: Does SpellCheck have a dark side?	cbaehr	Self-Promotions by Authors and Publishers	10	11-07-2010 01:45 PM
Calibre won't open after install -- ERROR: Bad database location	htbyron	Calibre	3	06-17-2010 11:43 AM

09-19-2015, 01:26 PM	#2
PandathePanda a toy panda Posts: 2,568 Karma: 26020474 Join Date: Mar 2014 Location: Onboard the Queen Anne's Revenge Device: Various Android dvices	If you got MS office, the one with onenote. You can use it to OCR. With my attempts I had about 80-85% success. But it depends on the quality.

09-19-2015, 06:08 PM	#3
SBT Fanatic Posts: 580 Karma: 810184 Join Date: Sep 2010 Location: Norway Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad	I use tesseract, which gives decent to very good results if the scans are half-decent. ABBYY gives appreciably better results, admittedly, but I haven't found anything better that's free. I use regexp quite a lot for initial OCR cleanup, I'll see if I can't find a list of standard expressions somewhere. Another trick I often use is word frequency, words that only occur once or twice are pretty often suspect. But I'd be stumped at output like the ones you show. May I ask what OCR you used?

09-20-2015, 01:02 AM	#5
GrannyGrump Obsessively Dedicated... Posts: 3,255 Karma: 35164265 Join Date: May 2011 Location: PA {back in the usa!} Device: Sony PRS-T2, ADE on PC	@Panda, I don't have Office, but maybe can use a friend's computer next time to try the OneNote version. @SBT -- I used the OCR that is included in PDFXchange reader -- since it is freeware, it might even use the tessaract engine. I will hunt down and try tessaract for the next time I hit one of these. It couldn't give me worse output than this, I think. Thank you so much for the Regex patterns. I bet they will be useful when I am dealing with OCR from archive.org. Thanks again to you both.

09-20-2015, 05:32 AM	#6
elibrarian Imperfect Perfectionist Posts: 741 Karma: 870414 Join Date: Dec 2011 Location: Ølstykke, Denmark Device: none	There actually exists tools to split texts like the first sample into legible words (!), but they seem to be very specialized and out of the league of ordinary people like you and me (I haven't tried any) - eg. wordsplit or wordsegment. Regarding the second one you might get something from the PepitoCleaner extension (general cleanup) and the Linguist extension (finding the most common misspellings, so they can be corrected by find-and-replace) - if you have LibreOffice. (Don't let the silly logo of Pepito Cleaner put you off - it's actually a very good tool - and you can also put in your own regex'es if needs be.) Regards Kim

10-19-2015, 08:51 AM	#8
Kennth Junior Member Posts: 7 Karma: 380010 Join Date: Sep 2015 Location: New York Device: none	Quality output depends on the quality input i.e In-outed images should have min 600 ppi resolution with properly scanned and cleaned images. Would recommend Tesseract OCR as an alternate.

10-20-2015, 02:18 AM	#9
grumbles Addict Posts: 238 Karma: 1500000 Join Date: Nov 2009 Location: Toronto Device: Pandigital Novel (Black), T-2 and 3, Nexus 7	@Granny: Do you know someone with a scanner. Most scanners come with an OCR program, most likely ABBYY Sprint. My Epson V500 came with Sprint 6.0 and the DS-30 came with Sprint 9.0. Both work well but in my experience they work better with Windows XP than 7. They do work well if you have W7 Pro and run them as XP programs. And there is ABBYY version 4 that was on the cover disc for an issue of PC Plus in the early 2000s.

Advert

Advert