Book has a lot of unusual characters. Possible to OCR?

u238110 · 06-20-2014, 01:54 PM

Without the result being filled with errors. See image.

I would imagine that if it is possible, I'd have to make some significant adjustments to the OCR properties of ABBYY FineReader..

Toxaris · 06-20-2014, 02:37 PM

Of course it is possible to do OCR. The results I cannot predict. What you can do, is add a document language. Choose manual and add a new one. Base it on a copy of english and add the additional characters seen in the example. Save it and ensure it is used as well.
No garantuees, but it should work as these characters are now part of the language possibilities according to ABBYY.

u238110 · 06-20-2014, 02:45 PM

It seems to boil down to these types of characters: ā á ã

So a line, an accent, and a tilda. So just add those three things for every single letter and I'm good to go?

Tex2002ans · 06-20-2014, 04:20 PM

What you want to do is go into "Tools - Language Editor":

Click image for larger version

Name: Step1ToolsLanguageEditor.png
Views: 307
Size: 67.6 KB
ID: 124407

Select "New...":

Click image for larger version

Name: Step2LanguageEditor.png
Views: 292
Size: 7.9 KB
ID: 124408

Create a new language based on an existing language:

Click image for larger version

Name: Step3NewLanguage.png
Views: 276
Size: 3.3 KB
ID: 124409

And under Alphabet, you want to toss in a bunch of the accented characters you see throughout your book at the very end of the list:

Śśāīṛṣ

Click image for larger version

Name: Step4LanguageProperties.png
Views: 290
Size: 5.5 KB
ID: 124410

I typically just copy/paste characters off of these Wiki pages (they are highly organized and very easy to visualize there):

https://en.wikipedia.org/wiki/Macron
https://en.wikipedia.org/wiki/Grave_accent
https://en.wikipedia.org/wiki/Acute_accent
https://en.wikipedia.org/wiki/Diaeresis_%28diacritic%29
https://en.wikipedia.org/wiki/Circumflex
https://en.wikipedia.org/wiki/Caron
https://en.wikipedia.org/wiki/Dot_%28diacritic%29
https://en.wikipedia.org/wiki/Tilde

You definitely want to iron out any sort of language/alphabet choices BEFORE you start mass OCRing the book. Because if you get halfway through the book, and finally notice Finereader is missing every single ā, depending on how many times that character occurs in the book, it might be extremely painful to go back and fix all of those manually.

If you swap languages halfway through, Finereader will complain and want to reOCR the entire thing under its new settings.

Side Note: I actually never ran across a book with so many (odd) accents, so I never actually tackled an OCR using this method. The books I convert just have the usual common English, German, French, Spanish accents.

I would probably err on the side of caution and insert AS FEW of these odd characters as possible. The OCR might become highly inaccurate if you start adding in too many. (For example, the bottom of the letter 'g' quite often swings close to the letter on the line below. It MAY mistake that as a different character with a caron/macron above it, etc. etc.).

As to the accuracy of characters with dots above/below, I don't know, I have never run across it in a book I had to OCR. The only one I can recall is one person's name with a capital I with a dot above it 'İ' (I believe it is used in Turkish?). I just manually inserted those whenever his name was mentioned.

Toxaris · 06-21-2014, 02:30 AM

Tex, thanks for the screenshots. I couldn't make them myself.

mrmikel · 06-21-2014, 06:37 AM

Books with Romaji, particularly earlier forms, are like this too. But I have only ran across one in the last 5 years or so. Will require some thought as to which font to use, too. There are free fonts which provide this, thank goodness.

u238110 · 06-30-2014, 11:17 AM

Thank you taking the time to make that excellent post, Tex2002ans.

martienne · 07-11-2014, 08:11 PM

I once OCR:d a book that had half the text in Swedish, and half the text in Russian.
Swedish had 3 extra characters and Russian is a Cyrillic language. To add to the challenge, the book was full of challenging graphics and pictures.

It needed a lot of hands-on corrections. But it DID work.

Abbyy is a Russian company actually - and they are very international in their outlook. I was surprised at how good Abbyy was at Swedish.

First of all you tell it what languages the text is in, then you have to manually "teach" it to recognise characters it's unfamiliar with, i.e. italics makes it harder for the recognition, as does any fancy/pretty fonts. OCR likes Arial and Times New Roman non-bold, non-italic.

06-20-2014, 01:54 PM	#1
u238110 Connoisseur Posts: 91 Karma: 10 Join Date: Feb 2014 Location: Long Island, NY Device: Aura, N514KUBKKEP, 4.7.10.413	Book has a lot of unusual characters. Possible to OCR? Without the result being filled with errors. See image. I would imagine that if it is possible, I'd have to make some significant adjustments to the OCR properties of ABBYY FineReader.. Attached Thumbnails Last edited by u238110; 06-20-2014 at 02:06 PM.

06-20-2014, 02:45 PM	#3
u238110 Connoisseur Posts: 91 Karma: 10 Join Date: Feb 2014 Location: Long Island, NY Device: Aura, N514KUBKKEP, 4.7.10.413	It seems to boil down to these types of characters: ā á ã So a line, an accent, and a tilda. So just add those three things for every single letter and I'm good to go? Last edited by u238110; 06-30-2014 at 10:30 AM.

06-20-2014, 04:20 PM	#4
Tex2002ans Wizard Posts: 2,306 Karma: 13057279 Join Date: Jul 2012 Device: Kobo Forma, Nook	What you want to do is go into "Tools - Language Editor": Select "New...": Create a new language based on an existing language: And under Alphabet, you want to toss in a bunch of the accented characters you see throughout your book at the very end of the list: Śśāīṛṣ I typically just copy/paste characters off of these Wiki pages (they are highly organized and very easy to visualize there): https://en.wikipedia.org/wiki/Macron https://en.wikipedia.org/wiki/Grave_accent https://en.wikipedia.org/wiki/Acute_accent https://en.wikipedia.org/wiki/Diaeresis_%28diacritic%29 https://en.wikipedia.org/wiki/Circumflex https://en.wikipedia.org/wiki/Caron https://en.wikipedia.org/wiki/Dot_%28diacritic%29 https://en.wikipedia.org/wiki/Tilde You definitely want to iron out any sort of language/alphabet choices BEFORE you start mass OCRing the book. Because if you get halfway through the book, and finally notice Finereader is missing every single ā, depending on how many times that character occurs in the book, it might be extremely painful to go back and fix all of those manually. If you swap languages halfway through, Finereader will complain and want to reOCR the entire thing under its new settings. Side Note: I actually never ran across a book with so many (odd) accents, so I never actually tackled an OCR using this method. The books I convert just have the usual common English, German, French, Spanish accents. I would probably err on the side of caution and insert AS FEW of these odd characters as possible. The OCR might become highly inaccurate if you start adding in too many. (For example, the bottom of the letter 'g' quite often swings close to the letter on the line below. It MAY mistake that as a different character with a caron/macron above it, etc. etc.). As to the accuracy of characters with dots above/below, I don't know, I have never run across it in a book I had to OCR. The only one I can recall is one person's name with a capital I with a dot above it 'İ' (I believe it is used in Turkish?). I just manually inserted those whenever his name was mentioned.

07-11-2014, 08:11 PM	#8
martienne .~^пиратка^~. Posts: 238 Karma: 14000 Join Date: Sep 2009 Location: Ask NSA... Device: Onyx Boox M92	I once OCR:d a book that had half the text in Swedish, and half the text in Russian. Swedish had 3 extra characters and Russian is a Cyrillic language. To add to the challenge, the book was full of challenging graphics and pictures. It needed a lot of hands-on corrections. But it DID work. Abbyy is a Russian company actually - and they are very international in their outlook. I was surprised at how good Abbyy was at Swedish. First of all you tell it what languages the text is in, then you have to manually "teach" it to recognise characters it's unfamiliar with, i.e. italics makes it harder for the recognition, as does any fancy/pretty fonts. OCR likes Arial and Times New Roman non-bold, non-italic. Last edited by martienne; 07-11-2014 at 08:14 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Get your Free copy today of my children's book Camel Lot: A Misplaced Adventure	The Karen Jones	Self-Promotions by Authors and Publishers	0	10-10-2012 08:54 AM
How to convert an OCR file to a Non-OCR one	res9282	PDF	1	08-05-2011 05:58 AM
Information Week: e-Book Readers Need To Get A Lot Cheaper	ekaser	News	7	09-08-2009 08:35 AM
Why would you use OCR for a 2007 book?	Barcey	News	4	11-10-2007 01:57 PM

06-20-2014, 02:37 PM	#2
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	Of course it is possible to do OCR. The results I cannot predict. What you can do, is add a document language. Choose manual and add a new one. Base it on a copy of english and add the additional characters seen in the example. Save it and ensure it is used as well. No garantuees, but it should work as these characters are now part of the language possibilities according to ABBYY.

06-21-2014, 02:30 AM	#5
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	Tex, thanks for the screenshots. I couldn't make them myself.

06-21-2014, 06:37 AM	#6
mrmikel Color me gone Posts: 2,089 Karma: 1445295 Join Date: Apr 2008 Location: Central Oregon Coast Device: PRS-300	Books with Romaji, particularly earlier forms, are like this too. But I have only ran across one in the last 5 years or so. Will require some thought as to which font to use, too. There are free fonts which provide this, thank goodness.

06-30-2014, 11:17 AM	#7
u238110 Connoisseur Posts: 91 Karma: 10 Join Date: Feb 2014 Location: Long Island, NY Device: Aura, N514KUBKKEP, 4.7.10.413	Thank you taking the time to make that excellent post, Tex2002ans.

Advert

Advert