Quote:
Originally Posted by ittiandro
As you can see from the attached, the non-text items ( drawings, diagrams, etc) are (almost) O.K. with the exception of page 2 where the drawing at the top of the original page is squarely missing.
|
Ok, so on the left half, do you see how the "Image Box" (red) is not covering the entire image?
For example, on Page 3, future/past figure is partically recognized as "Text", while the rest of the diagram is not recognized at all. You have to manually adjust this so that the entire figure is in ONE Image Box (red). You do this by hitting the picture button up top, and dragging a red square around the entire thing:
Then you have to manually go through the entire book and do a similar thing for the rest of the pages. Any images you find that are NOT in a box, you have to put the correct Text/Image/Table box around it:
Quote:
Originally Posted by ittiandro
All in all, the conversion is severely flawed because all of the special math characters and symbols of physics are misread and converted into other characters.
|
Yep yep, and things like hats, vectors, dots, overline, integrals, inline equations, Greek symbols... those are all going to give you an extremely hard time.
Here is a large discussion we had when talking about digitizing math texts to ebooks. You would probably run into all the same exact problems:
https://www.mobileread.com/forums/sho...d.php?t=228413
There is a reason why many of these non-fiction books are not in EPUB yet. If you don't have the original source files, it would just take way too much manpower to digitize the entire thing. It is just not worth it for most books (extremely high cost to digitize, and very low sales).
Quote:
Originally Posted by ittiandro
I am embarking now on another conversion job which I thought would be easier, but I am having second thoughts..
I have some PDF ( scanned) books containing Greek texts with the English translation side by side.
|
Sounds like another one that is on the "very hard" side of things. Multi-column texts are a pain in the butt. (I am currently digitizing a monthly newsletter, 2 and 3 column text. It is QUITE annoying and painstakingly slow.)
Especially in the case of two separate columns of text, Finereader is designed to tackle multi-column text such as journals. Where the text flows from the left bottom -> right top. Finereader will auto-merge those paragraphs/sentences for you because it assumes it is a continuation of the same text.
In your case, you would want left column -> left column on next page, right column -> right column on next page.
I would not recommend tackling this conversion either, unless you are MUCH more familiar with the tools.
Quote:
Originally Posted by ittiandro
Even though it is classic Greek, I thought that ABBYY could read it because the ancient Greek characters are exactly the same as those of modern Greek (with the exception of a number of accents and diacritic signs which have been dropped in modern Greek) and the language option of ABBYY lists Greek as one of the languages it can read.
|
Ouch again... Hopefully your scan is much higher quality as well, those accent signs are brutal. It takes me forever just to transcribe a sentence of Greek (heck, even single words take a while in some cases).
Doitsu pointed me to this resource, which might make it easier to do words with Greek Symbols:
http://www.lexilogos.com/keyboard/greek_ancient.htm
I also enjoy the organization of this Wikipedia article in order to visualize some of those harder accented characters:
https://en.wikipedia.org/wiki/Greek_diacritics
Quote:
Originally Posted by ittiandro
Unfortunately, it is not the case: Greek characters ( or something like them!) appear in the conversion, but many of them are missing or distorted beyond recognition, words are jumbled together, etc. All in all, the text is readable with difficulty or plainly unreadable. Perhaps Greek readers have something to say.
|
Again, this is probably going to be an even MORE painstaking undertaking, but you might have to Train Finereader to make this case slightly more accurate. You do this by going into Tools - Options - Read, and under "Training", you will want to select "Use built-in and user patterns" or "Use only user pattern".
- Use built-in and user patterns
- Finereader will do its best to OCR, but it will ask you whenever it runs across something it is "unsure" about.
- Use only user pattern
- You will have to build the OCR from scratch, character by character.
- This is more useful if you have a font/language that is just absolutely abysmal in Finereader, or the scan is quite poor (but still human readable).
Then you want to open the Pattern Editor, and create a new Pattern (probably called "Ancient Greek"). Now, you will also want to make sure that "Read with training" has a checkmark in it. If you press Read on your book now, the "Pattern Training" window will pop up:
Now you will have to go through character by character, and tell it exactly what Greek + diacritic character that is. Be warned, this is PAINFULLY slow, absolutely brutal, and most likely will only work in THAT SPECIFIC FONT (I doubt you will be working on books with that exact font again).
Side Note: In my opinion, huge waste of time, better to spend your limited manpower elsewhere.
Side Note: While looking up information on this Greek stuff, I stumbled upon:
http://wiki.digitalclassicist.org/OCR_for_ancient_Greek
Which lead to this:
http://ancientgreekocr.org/
Perhaps that might work better than Finereader's default Greek recognition.