MobileRead Forums - View Single Post - defining non english characters in an english epub

Greg Anos · 06-10-2021, 07:30 PM

Quote:

Originally Posted by Tex2002ans

Why, exactly, are you trying to use hex codes instead of just using the actual character?

In EPUB, the only special entity you have to worry about is the Non-Breaking Space (  or  ).

Everything else can use the actual Unicode characters:

— = Em Dash

There's no need to clutter your code with —.

Great. What tools are you using to scan + OCR?

Only because the OCR isn't recognizing these characters?

OCR outputs:

facade
ninos

but your actual article says:

façade
niños

Usually, if you enable the proper OCR languages, these accented characters will be recognized.

Side Note: I wrote a bit about OCR + German/Spanish/French accents in "Abbyy Finereader 15 gothic/Fraktur Altdeutsch/Oldgerman" (Post #5).

For example, English only recognizes A-Z while Spanish will recognize A-Z + a few more:

ÁÉÍÑÓÚÜáéíñóúü

So if you have a 98% English book with 2% Spanish names/words, you'd tell OCR this is an English AND Spanish book. This would catch all the little accents on the ñ and á and é.

It's a case of "you can't get there from here" for me.

Let me fill you in on the background. I used to have a Optibook (scan?) dedicated scanner + ABBYY Finereader 9. It did reasonably good, until the scanner died. I refuse to spend more money for another dedicated scanner.

I currently have Brothers multi-function scanner/printer/copier hooked up. It has its own OCR software. I did some test shots, and it OCR'ed better than Finereader 9.

However -

I want fully reflowable print streams. I want the e-book reader to be able to change font size with no problems. That's where it gets real sticky.

To do fully reflowable text, there can be no hard line breaks(feeds) other that paragraph breaks. Otherwise you get into the "long line plus short line" when the font overflows The line length.

aaaaaaaaaaaa
bbbbbbbbbbbb

Becomes

aaaaaaaaaa
aa
bbbbbbbbbb
bb

Which is not acceptable. (to me, anyways - and I'M doing the work!)

My OCR output choices are

TXT
RTF
HTML
XML

Which one would you rather use - to glue together 40 separate pages of text, and make them reflowable?

If you use TXT, goodbye to all you bold, italics, ect. BUT removing the line feed character is a piece of cake for a hex editor. (which I have in a virtualbox XP machine, which I run under Linux). No control characters at all, so LibreOffice has no problems with expanding and contracting font size (or type).

RTF is FULL of control characters, of ALL sorts. Yes, you can blow up the font size, but the text steps on itself, because there are other control characters that don't change, that control the line spacing. Play with it yourself to see what I mean. Ripping out the RTF control data is a Royal pain the the patootie - I used to do it when I used finereader 9. And yes, those line size definition are translated into ODT and DOC files, converting does not solve the problem.

HTML/XML has its own set of glue together problems. I am not fluent in HTML. I know just enough to limp along with it.

So, given my choices, I picked TXT and add back in the italics, ect. Since I am already in the hex editor anyways, adding unicode characters wouldn't be that big of a bother. OTOH, I can add those other language characters, once the text is glue'd together. Yes, I'm using LibreWriter. There characters are only used for place names, and for scientific journal citations, in journals that are not in English. They are usually in Portuguese or French.I needed the umlauts, because some of the authors are German, with umlauts in their last names.

I'm still missing 2 characters - a "c" with a cap on it Think the letter V (upside down, with the point of the v at the top) on the "c", and an "a" with a tilde; you know, the squiggle line over the n in Spanish. (it's a Portuguese feature.)

This is a long haul, if you have better ways to do it, please let me know. (I've done 8 out of 180, so far.)