MobileRead Forums - View Single Post - defining non english characters in an english epub

Tex2002ans · 06-10-2021, 02:53 PM

Quote:

Originally Posted by Greg Anos

Is there a hex guide for defining these character sets (as hex strings)? (Like em dashes being defined a a 3 hex character string.)

Why, exactly, are you trying to use hex codes instead of just using the actual character?

In EPUB, the only special entity you have to worry about is the Non-Breaking Space (  or  ).

Everything else can use the actual Unicode characters:

— = Em Dash

There's no need to clutter your code with —.

Quote:

Originally Posted by Greg Anos

I am doing a long project of scanning and converting to ePub 30 years of a specialty hobby journal (pro bono publico).

Great. What tools are you using to scan + OCR?

Quote:

Originally Posted by Greg Anos

I need to use the occasional non-english character (letter with tilde, umlat, and French letter characters).

Only because the OCR isn't recognizing these characters?

OCR outputs:

facade
ninos

but your actual article says:

façade
niños

Usually, if you enable the proper OCR languages, these accented characters will be recognized.

Side Note: I wrote a bit about OCR + German/Spanish/French accents in "Abbyy Finereader 15 gothic/Fraktur Altdeutsch/Oldgerman" (Post #5).

For example, English only recognizes A-Z while Spanish will recognize A-Z + a few more:

ÁÉÍÑÓÚÜáéíñóúü

So if you have a 98% English book with 2% Spanish names/words, you'd tell OCR this is an English AND Spanish book. This would catch all the little accents on the ñ and á and é.