Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 09-19-2015, 10:09 AM   #1
GrannyGrump
Obsessively Dedicated...
GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.
 
GrannyGrump's Avatar
 
Posts: 3,241
Karma: 35158061
Join Date: May 2011
Location: PA {back in the usa!}
Device: Sony PRS-T2, ADE on PC
Bad OCR... When spellcheck won't help

Some recent stories I worked on, I had to use a free OCR service, it gave results like these:

the first was very tightly kerned, and the OCR got all the characters right, but not the word breaks.
Quote:
Thenthroughanotherkitchen,
where redrustwasmaking itsfull
mealof a comparativelymodernrange.
Then into the greathall wherethe
old armorandthebuff-coatsandround
capshungonthewalls,andwherethe
carvedstonestaircasesran at eachside
Another was at least 75% gibberish.
Quote:
and tho enorrnous vuoee of solid silver, we
heavy for l1i1n to 1il't—r_-.ve1r these were hie-
hrrd lrc not found tlrer|r—he, by his own skill
rr,||r_i tunnirrg P He went about In Llro rooms,
wurrlriirg one after the other the beautiful,
mre things. Hr: oun:ssr:1.l the gold and the
jrrwele. He tlrrerr his nrms round the great
silver vnsos; he wound round lrirnself tho
l'iun\"j rod velvet of the crltlnllr Wlroro tllo
grithrrsgleaured in embossed goldnrnd shi ‘~'IJ1‘Ei.l
I had to pretty much manually go through the first and force word breaks, and manually transcribe the second.

Abby Finereader is not in my future, sorry to say. Is there a free software that might give better results than I got here?

Regex is absolutely my weak spot, but does anyone have any suggestions for the next time I run against this type of situation?

Last edited by GrannyGrump; 09-19-2015 at 10:12 AM.
GrannyGrump is offline   Reply With Quote
Old 09-19-2015, 12:26 PM   #2
PandathePanda
a toy panda
PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.
 
PandathePanda's Avatar
 
Posts: 2,568
Karma: 26020474
Join Date: Mar 2014
Location: Onboard the Queen Anne's Revenge
Device: Various Android dvices
If you got MS office, the one with onenote. You can use it to OCR. With my attempts I had about 80-85% success. But it depends on the quality.
PandathePanda is offline   Reply With Quote
Old 09-19-2015, 05:08 PM   #3
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
I use tesseract, which gives decent to very good results if the scans are half-decent. ABBYY gives appreciably better results, admittedly, but I haven't found anything better that's free. I use regexp quite a lot for initial OCR cleanup, I'll see if I can't find a list of standard expressions somewhere. Another trick I often use is word frequency, words that only occur once or twice are pretty often suspect. But I'd be stumped at output like the ones you show. May I ask what OCR you used?
SBT is offline   Reply With Quote
Old 09-19-2015, 05:21 PM   #4
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
These are the things I normally look for in tesseract output:
'fi' and 'fl' often get mixed up. Still haven't found the best pattern here, but
/fi[oaie]/ definitely catches a few dodgy ones.
/[a-zA-Z][0-9]/ : a letter followed by a digit is pretty dodgy, though normally this only happens with 0 and 1.
/[a-z][A-Z]/ : a lower-case followed by an upper-case
/ [,.?!:]/ : Whitespace before punctuation.
/\b[bcdfhjklmnopqrstuvwxyz]\b/ : Single-letter words, excluding a, e.g., i.e.
/\b\(tl\|nr\|rr\)/ : Impossible beginnings to English words. Any linguists care to expand?

Last edited by SBT; 09-19-2015 at 05:23 PM.
SBT is offline   Reply With Quote
Old 09-20-2015, 12:02 AM   #5
GrannyGrump
Obsessively Dedicated...
GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.
 
GrannyGrump's Avatar
 
Posts: 3,241
Karma: 35158061
Join Date: May 2011
Location: PA {back in the usa!}
Device: Sony PRS-T2, ADE on PC
@Panda, I don't have Office, but maybe can use a friend's computer next time to try the OneNote version.

@SBT -- I used the OCR that is included in PDFXchange reader -- since it is freeware, it might even use the tessaract engine. I will hunt down and try tessaract for the next time I hit one of these. It couldn't give me worse output than this, I think.
Thank you so much for the Regex patterns. I bet they will be useful when I am dealing with OCR from archive.org.

Thanks again to you both.
GrannyGrump is offline   Reply With Quote
Old 09-20-2015, 04:32 AM   #6
elibrarian
Imperfect Perfectionist
elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.
 
elibrarian's Avatar
 
Posts: 715
Karma: 863576
Join Date: Dec 2011
Location: Ølstykke, Denmark
Device: none
There actually exists tools to split texts like the first sample into legible words (!), but they seem to be very specialized and out of the league of ordinary people like you and me (I haven't tried any) - eg. wordsplit or wordsegment.

Regarding the second one you might get something from the PepitoCleaner extension (general cleanup) and the Linguist extension (finding the most common misspellings, so they can be corrected by find-and-replace) - if you have LibreOffice.

(Don't let the silly logo of Pepito Cleaner put you off - it's actually a very good tool - and you can also put in your own regex'es if needs be.)

Regards

Kim
elibrarian is offline   Reply With Quote
Old 09-20-2015, 07:40 AM   #7
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
@elibrarian: Thanks for the tip on the LibreOffice extensions.

Meanwhile, I've started looking into a particular problem of mine.
"My" ebooks are mostly 19th century Norwegian books, using spelling and grammar that's somewhere half-way between Danish and modern Norwegian, meaning I can't use spell-checkers, because no pre-1907 Norwegian ispell dictionary exists. However, a lot of proper proof-read digital 19th c. Norwegian texts exist (>10,000 pages).

I came across this 21-line(!) spell-checker at norvig.com, the site more famous as the origin of the PowerPoint version of the Gettysburg address. Based on a huge reference text, it checks spelling of a word. Though oriented towards human errors like transposition, it should be possible to tweak it to look for typical OCR mistakes, like 'f00lish' and 'junip'
SBT is offline   Reply With Quote
Old 10-19-2015, 07:51 AM   #8
Kennth
Junior Member
Kennth ought to be getting tired of karma fortunes by now.Kennth ought to be getting tired of karma fortunes by now.Kennth ought to be getting tired of karma fortunes by now.Kennth ought to be getting tired of karma fortunes by now.Kennth ought to be getting tired of karma fortunes by now.Kennth ought to be getting tired of karma fortunes by now.Kennth ought to be getting tired of karma fortunes by now.Kennth ought to be getting tired of karma fortunes by now.Kennth ought to be getting tired of karma fortunes by now.Kennth ought to be getting tired of karma fortunes by now.Kennth ought to be getting tired of karma fortunes by now.
 
Posts: 7
Karma: 380010
Join Date: Sep 2015
Location: New York
Device: none
Quality output depends on the quality input i.e In-outed images should have min 600 ppi resolution with properly scanned and cleaned images. Would recommend Tesseract OCR as an alternate.
Kennth is offline   Reply With Quote
Old 10-20-2015, 01:18 AM   #9
grumbles
Addict
grumbles ought to be getting tired of karma fortunes by now.grumbles ought to be getting tired of karma fortunes by now.grumbles ought to be getting tired of karma fortunes by now.grumbles ought to be getting tired of karma fortunes by now.grumbles ought to be getting tired of karma fortunes by now.grumbles ought to be getting tired of karma fortunes by now.grumbles ought to be getting tired of karma fortunes by now.grumbles ought to be getting tired of karma fortunes by now.grumbles ought to be getting tired of karma fortunes by now.grumbles ought to be getting tired of karma fortunes by now.grumbles ought to be getting tired of karma fortunes by now.
 
grumbles's Avatar
 
Posts: 238
Karma: 1500000
Join Date: Nov 2009
Location: Toronto
Device: Pandigital Novel (Black), T-2 and 3, Nexus 7
@Granny:

Do you know someone with a scanner. Most scanners come with an OCR program, most likely ABBYY Sprint. My Epson V500 came with Sprint 6.0 and the DS-30 came with Sprint 9.0. Both work well but in my experience they work better with Windows XP than 7. They do work well if you have W7 Pro and run them as XP programs. And there is ABBYY version 4 that was on the cover disc for an issue of PC Plus in the early 2000s.
grumbles is offline   Reply With Quote
Old 10-21-2015, 10:04 PM   #10
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,312
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by GrannyGrump View Post
@Panda, I don't have Office, but maybe can use a friend's computer next time to try the OneNote version.

@SBT -- I used the OCR that is included in PDFXchange reader -- since it is freeware, it might even use the tessaract engine. I will hunt down and try tessaract for the next time I hit one of these. It couldn't give me worse output than this, I think.
Thank you so much for the Regex patterns. I bet they will be useful when I am dealing with OCR from archive.org.

Thanks again to you both.
FYI, you can process a scanned PDF with k2pdfopt to OCR it with the tesseract engine, which is built into k2pdfopt (free software).
willus is offline   Reply With Quote
Old 10-21-2015, 11:22 PM   #11
PandathePanda
a toy panda
PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.
 
PandathePanda's Avatar
 
Posts: 2,568
Karma: 26020474
Join Date: Mar 2014
Location: Onboard the Queen Anne's Revenge
Device: Various Android dvices
Quote:
Originally Posted by willus View Post
FYI, you can process a scanned PDF with k2pdfopt to OCR it with the tesseract engine, which is built into k2pdfopt (free software).
Tried it and it failed to "convert" the one PDF I'm slowly OCRing, it does the first to pages then it it fails
PandathePanda is offline   Reply With Quote
Old 10-22-2015, 08:42 AM   #12
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,312
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by PandathePanda View Post
Tried it and it failed to "convert" the one PDF I'm slowly OCRing, it does the first to pages then it it fails
Are you able to PM me a copy of the source PDF that fails (and the settings / platform you are using)? If so, I will investigate.
willus is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Add Spellcheck Dictionary Leonatus Editor 6 02-12-2015 12:16 PM
Request for future spellcheck mrmikel Editor 1 03-21-2014 11:42 AM
How to convert an OCR file to a Non-OCR one res9282 PDF 1 08-05-2011 05:58 AM
SPELLCHECK NATION: Does SpellCheck have a dark side? cbaehr Self-Promotions by Authors and Publishers 10 11-07-2010 12:45 PM
Calibre won't open after install -- ERROR: Bad database location htbyron Calibre 3 06-17-2010 10:43 AM


All times are GMT -4. The time now is 12:38 PM.


MobileRead.com is a privately owned, operated and funded community.