Quote:
Originally Posted by Greg Anos
Is there a hex guide for defining these character sets (as hex strings)? (Like em dashes being defined a a 3 hex character string.)
|
Why, exactly, are you trying to use hex codes instead of just using the actual character?
In EPUB, the only special entity you have to worry about is the Non-Breaking Space ( or  ).
Everything else can use the actual Unicode characters:
— = Em Dash
There's no need to clutter your code with —.
Quote:
Originally Posted by Greg Anos
I am doing a long project of scanning and converting to ePub 30 years of a specialty hobby journal (pro bono publico).
|
Great. What tools are you using to scan + OCR?
Quote:
Originally Posted by Greg Anos
I need to use the occasional non-english character (letter with tilde, umlat, and French letter characters).
|
Only because the OCR isn't recognizing these characters?
OCR outputs:
but your actual article says:
Usually, if you enable the proper OCR languages, these accented characters will be recognized.
Side Note: I wrote a bit about OCR + German/Spanish/French accents in
"Abbyy Finereader 15 gothic/Fraktur Altdeutsch/Oldgerman" (Post #5).
For example, English only recognizes A-Z while Spanish will recognize A-Z + a few more:
So if you have a 98% English book with 2% Spanish names/words, you'd tell OCR this is an English AND Spanish book. This would catch all the little accents on the ñ and á and é.