Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 06-20-2014, 02:54 PM   #1
u238110
Connoisseur
u238110 began at the beginning.
 
Posts: 53
Karma: 10
Join Date: Feb 2014
Location: Long Island, NY
Device: none
Book has a lot of unusual characters. Possible to OCR?

Without the result being filled with errors. See image.

I would imagine that if it is possible, I'd have to make some significant adjustments to the OCR properties of ABBYY FineReader..
Attached Thumbnails
Click image for larger version

Name:	SAM_0159.JPG
Views:	83
Size:	417.9 KB
ID:	124396  

Last edited by u238110; 06-20-2014 at 03:06 PM.
u238110 is offline   Reply With Quote
Old 06-20-2014, 03:37 PM   #2
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 3,097
Karma: 5658305
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-300, PRS-T1
Of course it is possible to do OCR. The results I cannot predict. What you can do, is add a document language. Choose manual and add a new one. Base it on a copy of english and add the additional characters seen in the example. Save it and ensure it is used as well.
No garantuees, but it should work as these characters are now part of the language possibilities according to ABBYY.
Toxaris is offline   Reply With Quote
Old 06-20-2014, 03:45 PM   #3
u238110
Connoisseur
u238110 began at the beginning.
 
Posts: 53
Karma: 10
Join Date: Feb 2014
Location: Long Island, NY
Device: none
It seems to boil down to these types of characters: ā á ã

So a line, an accent, and a tilda. So just add those three things for every single letter and I'm good to go?

Last edited by u238110; 06-30-2014 at 11:30 AM.
u238110 is offline   Reply With Quote
Old 06-20-2014, 05:20 PM   #4
Tex2002ans
Fanatic
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 530
Karma: 562971
Join Date: Jul 2012
Device: Nook
What you want to do is go into "Tools - Language Editor":

Click image for larger version

Name:	Step1ToolsLanguageEditor.png
Views:	41
Size:	67.6 KB
ID:	124407

Select "New...":

Click image for larger version

Name:	Step2LanguageEditor.png
Views:	37
Size:	7.9 KB
ID:	124408

Create a new language based on an existing language:

Click image for larger version

Name:	Step3NewLanguage.png
Views:	38
Size:	3.3 KB
ID:	124409

And under Alphabet, you want to toss in a bunch of the accented characters you see throughout your book at the very end of the list:

Śśāīṛṣ

Click image for larger version

Name:	Step4LanguageProperties.png
Views:	37
Size:	5.5 KB
ID:	124410

I typically just copy/paste characters off of these Wiki pages (they are highly organized and very easy to visualize there):

https://en.wikipedia.org/wiki/Macron
https://en.wikipedia.org/wiki/Grave_accent
https://en.wikipedia.org/wiki/Acute_accent
https://en.wikipedia.org/wiki/Diaeresis_%28diacritic%29
https://en.wikipedia.org/wiki/Circumflex
https://en.wikipedia.org/wiki/Caron
https://en.wikipedia.org/wiki/Dot_%28diacritic%29
https://en.wikipedia.org/wiki/Tilde

You definitely want to iron out any sort of language/alphabet choices BEFORE you start mass OCRing the book. Because if you get halfway through the book, and finally notice Finereader is missing every single ā, depending on how many times that character occurs in the book, it might be extremely painful to go back and fix all of those manually.

If you swap languages halfway through, Finereader will complain and want to reOCR the entire thing under its new settings.

Side Note: I actually never ran across a book with so many (odd) accents, so I never actually tackled an OCR using this method. The books I convert just have the usual common English, German, French, Spanish accents.

I would probably err on the side of caution and insert AS FEW of these odd characters as possible. The OCR might become highly inaccurate if you start adding in too many. (For example, the bottom of the letter 'g' quite often swings close to the letter on the line below. It MAY mistake that as a different character with a caron/macron above it, etc. etc.).

As to the accuracy of characters with dots above/below, I don't know, I have never run across it in a book I had to OCR. The only one I can recall is one person's name with a capital I with a dot above it 'İ' (I believe it is used in Turkish?). I just manually inserted those whenever his name was mentioned.
Tex2002ans is offline   Reply With Quote
Old 06-21-2014, 03:30 AM   #5
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 3,097
Karma: 5658305
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-300, PRS-T1
Tex, thanks for the screenshots. I couldn't make them myself.
Toxaris is offline   Reply With Quote
Old 06-21-2014, 07:37 AM   #6
mrmikel
Color me gone
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,086
Karma: 1444487
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
Books with Romaji, particularly earlier forms, are like this too. But I have only ran across one in the last 5 years or so. Will require some thought as to which font to use, too. There are free fonts which provide this, thank goodness.
mrmikel is offline   Reply With Quote
Old 06-30-2014, 12:17 PM   #7
u238110
Connoisseur
u238110 began at the beginning.
 
Posts: 53
Karma: 10
Join Date: Feb 2014
Location: Long Island, NY
Device: none
Thank you taking the time to make that excellent post, Tex2002ans.
u238110 is offline   Reply With Quote
Old 07-11-2014, 09:11 PM   #8
martienne
.~^пиратка^~.
martienne can read faster than his screen refreshesmartienne can read faster than his screen refreshesmartienne can read faster than his screen refreshesmartienne can read faster than his screen refreshesmartienne can read faster than his screen refreshesmartienne can read faster than his screen refreshesmartienne can read faster than his screen refreshesmartienne can read faster than his screen refreshesmartienne can read faster than his screen refreshesmartienne can read faster than his screen refreshesmartienne can read faster than his screen refreshes
 
martienne's Avatar
 
Posts: 218
Karma: 14000
Join Date: Sep 2009
Location: Ask NSA...
Device: Onyx Boox M92
I once OCR:d a book that had half the text in Swedish, and half the text in Russian.
Swedish had 3 extra characters and Russian is a Cyrillic language. To add to the challenge, the book was full of challenging graphics and pictures.

It needed a lot of hands-on corrections. But it DID work.

Abbyy is a Russian company actually - and they are very international in their outlook. I was surprised at how good Abbyy was at Swedish.

First of all you tell it what languages the text is in, then you have to manually "teach" it to recognise characters it's unfamiliar with, i.e. italics makes it harder for the recognition, as does any fancy/pretty fonts. OCR likes Arial and Times New Roman non-bold, non-italic.

Last edited by martienne; 07-11-2014 at 09:14 PM.
martienne is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Get your Free copy today of my children's book Camel Lot: A Misplaced Adventure The Karen Jones Self-Promotions by Authors and Publishers 0 10-10-2012 09:54 AM
How to convert an OCR file to a Non-OCR one res9282 PDF 1 08-05-2011 06:58 AM
Information Week: e-Book Readers Need To Get A Lot Cheaper ekaser News 7 09-08-2009 09:35 AM
Why would you use OCR for a 2007 book? Barcey News 4 11-10-2007 02:57 PM


All times are GMT -4. The time now is 08:42 PM.


MobileRead.com is a privately owned, operated and funded community.