06-20-2014, 01:54 PM | #1 |
Connoisseur
Posts: 91
Karma: 10
Join Date: Feb 2014
Location: Long Island, NY
Device: Aura, N514KUBKKEP, 4.7.10.413
|
Book has a lot of unusual characters. Possible to OCR?
Without the result being filled with errors. See image.
I would imagine that if it is possible, I'd have to make some significant adjustments to the OCR properties of ABBYY FineReader.. Last edited by u238110; 06-20-2014 at 02:06 PM. |
06-20-2014, 02:37 PM | #2 |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
Of course it is possible to do OCR. The results I cannot predict. What you can do, is add a document language. Choose manual and add a new one. Base it on a copy of english and add the additional characters seen in the example. Save it and ensure it is used as well.
No garantuees, but it should work as these characters are now part of the language possibilities according to ABBYY. |
Advert | |
|
06-20-2014, 02:45 PM | #3 |
Connoisseur
Posts: 91
Karma: 10
Join Date: Feb 2014
Location: Long Island, NY
Device: Aura, N514KUBKKEP, 4.7.10.413
|
It seems to boil down to these types of characters: ā á ã
So a line, an accent, and a tilda. So just add those three things for every single letter and I'm good to go? Last edited by u238110; 06-30-2014 at 10:30 AM. |
06-20-2014, 04:20 PM | #4 |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
What you want to do is go into "Tools - Language Editor":
Select "New...": Create a new language based on an existing language: And under Alphabet, you want to toss in a bunch of the accented characters you see throughout your book at the very end of the list: Śśāīṛṣ I typically just copy/paste characters off of these Wiki pages (they are highly organized and very easy to visualize there): https://en.wikipedia.org/wiki/Macron https://en.wikipedia.org/wiki/Grave_accent https://en.wikipedia.org/wiki/Acute_accent https://en.wikipedia.org/wiki/Diaeresis_%28diacritic%29 https://en.wikipedia.org/wiki/Circumflex https://en.wikipedia.org/wiki/Caron https://en.wikipedia.org/wiki/Dot_%28diacritic%29 https://en.wikipedia.org/wiki/Tilde You definitely want to iron out any sort of language/alphabet choices BEFORE you start mass OCRing the book. Because if you get halfway through the book, and finally notice Finereader is missing every single ā, depending on how many times that character occurs in the book, it might be extremely painful to go back and fix all of those manually. If you swap languages halfway through, Finereader will complain and want to reOCR the entire thing under its new settings. Side Note: I actually never ran across a book with so many (odd) accents, so I never actually tackled an OCR using this method. The books I convert just have the usual common English, German, French, Spanish accents. I would probably err on the side of caution and insert AS FEW of these odd characters as possible. The OCR might become highly inaccurate if you start adding in too many. (For example, the bottom of the letter 'g' quite often swings close to the letter on the line below. It MAY mistake that as a different character with a caron/macron above it, etc. etc.). As to the accuracy of characters with dots above/below, I don't know, I have never run across it in a book I had to OCR. The only one I can recall is one person's name with a capital I with a dot above it 'İ' (I believe it is used in Turkish?). I just manually inserted those whenever his name was mentioned. |
06-21-2014, 02:30 AM | #5 |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
Tex, thanks for the screenshots. I couldn't make them myself.
|
Advert | |
|
06-21-2014, 06:37 AM | #6 |
Color me gone
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
Books with Romaji, particularly earlier forms, are like this too. But I have only ran across one in the last 5 years or so. Will require some thought as to which font to use, too. There are free fonts which provide this, thank goodness.
|
06-30-2014, 11:17 AM | #7 |
Connoisseur
Posts: 91
Karma: 10
Join Date: Feb 2014
Location: Long Island, NY
Device: Aura, N514KUBKKEP, 4.7.10.413
|
Thank you taking the time to make that excellent post, Tex2002ans.
|
07-11-2014, 08:11 PM | #8 |
.~^пиратка^~.
Posts: 238
Karma: 14000
Join Date: Sep 2009
Location: Ask NSA...
Device: Onyx Boox M92
|
I once OCR:d a book that had half the text in Swedish, and half the text in Russian.
Swedish had 3 extra characters and Russian is a Cyrillic language. To add to the challenge, the book was full of challenging graphics and pictures. It needed a lot of hands-on corrections. But it DID work. Abbyy is a Russian company actually - and they are very international in their outlook. I was surprised at how good Abbyy was at Swedish. First of all you tell it what languages the text is in, then you have to manually "teach" it to recognise characters it's unfamiliar with, i.e. italics makes it harder for the recognition, as does any fancy/pretty fonts. OCR likes Arial and Times New Roman non-bold, non-italic. Last edited by martienne; 07-11-2014 at 08:14 PM. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Get your Free copy today of my children's book Camel Lot: A Misplaced Adventure | The Karen Jones | Self-Promotions by Authors and Publishers | 0 | 10-10-2012 08:54 AM |
How to convert an OCR file to a Non-OCR one | res9282 | 1 | 08-05-2011 05:58 AM | |
Information Week: e-Book Readers Need To Get A Lot Cheaper | ekaser | News | 7 | 09-08-2009 08:35 AM |
Why would you use OCR for a 2007 book? | Barcey | News | 4 | 11-10-2007 01:57 PM |