defining non english characters in an english epub

Greg Anos · 06-10-2021, 11:04 AM

I am doing a long project of scanning and converting to ePub 30 years of a specialty hobby journal (pro bono publico). I need to use the occasional non-english character (letter with tilde, umlat, and French letter characters).

Is there a hex guide for defining these character sets (as hex strings)? (Like em dashes being defined a a 3 hex character string.)

Thanks.

Quoth · 06-10-2021, 11:52 AM

I use LibreOffice Writer. I've had no problem with any epub reader, or conversion to mobi or azw with any accented letter used in Europe-Scandinavia Iceland.
I just type them!
Some use the Linux Compose key.
The ONLY ones I can't type are the prime and double prime: ′ ″ so I use “Insert special character” or a charmap tool.

hyphen -
en –
em —
ellipsis …
áéíóú ÁÉÍÓÚ
àèìòù ÀÈÌÒÙ
üöëäï
ł Ł þ Þ ŧ Ŧ Ω ẃ Ẃ ý Ý ß § «»ç Ç “” ‘ ’ ñ Ñ ð Ð đ ª ŋ Ŋ ħ Ħ ĸ

You can copy & paste if you are on Windows or Mac. Try Spanish Keyboard layout on Windows. Above covers Scandinavian/Icelandic, Polish, French, German, Spanish, Irish etc. I may have missed a few.

Some greek letters will work. But not all, nor Cyrillic.
αβδ etc

Or are you looking for HTML entities?

I always edit using odt format with LO Writer, do an extra save As docx and use Calibre to make the epub.

I MIGHT occasionally edit CSS, but not needed if Styles in LO Writer used correctly. I never edit HTML now, not for over 15 years.

Jellby · 06-10-2021, 12:11 PM

You can always use Unicode codepoints. For example, suppose you want to input ÿ (y with diaeresis, a rare French letter not present in Quoth's list

). First thing, you search any of a zillion resources for Unicode characters, e.g. https://www.compart.com/en/unicode/U+00FF. There are several ways to get the code/entity.

1. From the "U+00FF", this means the code is 00FF, which is an hexadecimal number. As with decimal numbers, leading zeros can be removed, so it's "FF", just prepend &#x and append a semicolon and you're done: ÿ (case insensitive).

2. Convert the code to decimal, if it's not given in the page. In this case, since F = 15, we get 15*16+15 = 255. Now just prepend &# (note, no x here) and append a semicolon: ÿ.

3. Most pages will directly give you the HTML codes: ÿ ÿ ÿ. Just avoid the "mnemonic" one (ÿ), which is for true HTML files, and might not be supported in ebook readers, and pick any of the numeric ones.

Tex2002ans · 06-10-2021, 01:53 PM

Quote:

Originally Posted by Greg Anos

Is there a hex guide for defining these character sets (as hex strings)? (Like em dashes being defined a a 3 hex character string.)

Why, exactly, are you trying to use hex codes instead of just using the actual character?

In EPUB, the only special entity you have to worry about is the Non-Breaking Space (  or  ).

Everything else can use the actual Unicode characters:

— = Em Dash

There's no need to clutter your code with —.

Quote:

Originally Posted by Greg Anos

I am doing a long project of scanning and converting to ePub 30 years of a specialty hobby journal (pro bono publico).

Great. What tools are you using to scan + OCR?

Quote:

Originally Posted by Greg Anos

I need to use the occasional non-english character (letter with tilde, umlat, and French letter characters).

Only because the OCR isn't recognizing these characters?

OCR outputs:

facade
ninos

but your actual article says:

façade
niños

Usually, if you enable the proper OCR languages, these accented characters will be recognized.

Side Note: I wrote a bit about OCR + German/Spanish/French accents in "Abbyy Finereader 15 gothic/Fraktur Altdeutsch/Oldgerman" (Post #5).

For example, English only recognizes A-Z while Spanish will recognize A-Z + a few more:

ÁÉÍÑÓÚÜáéíñóúü

So if you have a 98% English book with 2% Spanish names/words, you'd tell OCR this is an English AND Spanish book. This would catch all the little accents on the ñ and á and é.

Greg Anos · 06-10-2021, 06:30 PM

Quote:

Originally Posted by Tex2002ans

Why, exactly, are you trying to use hex codes instead of just using the actual character?

In EPUB, the only special entity you have to worry about is the Non-Breaking Space (  or  ).

Everything else can use the actual Unicode characters:

— = Em Dash

There's no need to clutter your code with —.

Great. What tools are you using to scan + OCR?

Only because the OCR isn't recognizing these characters?

OCR outputs:

facade
ninos

but your actual article says:

façade
niños

Usually, if you enable the proper OCR languages, these accented characters will be recognized.

Side Note: I wrote a bit about OCR + German/Spanish/French accents in "Abbyy Finereader 15 gothic/Fraktur Altdeutsch/Oldgerman" (Post #5).

For example, English only recognizes A-Z while Spanish will recognize A-Z + a few more:

ÁÉÍÑÓÚÜáéíñóúü

So if you have a 98% English book with 2% Spanish names/words, you'd tell OCR this is an English AND Spanish book. This would catch all the little accents on the ñ and á and é.

It's a case of "you can't get there from here" for me.

Let me fill you in on the background. I used to have a Optibook (scan?) dedicated scanner + ABBYY Finereader 9. It did reasonably good, until the scanner died. I refuse to spend more money for another dedicated scanner.

I currently have Brothers multi-function scanner/printer/copier hooked up. It has its own OCR software. I did some test shots, and it OCR'ed better than Finereader 9.

However -

I want fully reflowable print streams. I want the e-book reader to be able to change font size with no problems. That's where it gets real sticky.

To do fully reflowable text, there can be no hard line breaks(feeds) other that paragraph breaks. Otherwise you get into the "long line plus short line" when the font overflows The line length.

aaaaaaaaaaaa
bbbbbbbbbbbb

Becomes

aaaaaaaaaa
aa
bbbbbbbbbb
bb

Which is not acceptable. (to me, anyways - and I'M doing the work!)

My OCR output choices are

TXT
RTF
HTML
XML

Which one would you rather use - to glue together 40 separate pages of text, and make them reflowable?

If you use TXT, goodbye to all you bold, italics, ect. BUT removing the line feed character is a piece of cake for a hex editor. (which I have in a virtualbox XP machine, which I run under Linux). No control characters at all, so LibreOffice has no problems with expanding and contracting font size (or type).

RTF is FULL of control characters, of ALL sorts. Yes, you can blow up the font size, but the text steps on itself, because there are other control characters that don't change, that control the line spacing. Play with it yourself to see what I mean. Ripping out the RTF control data is a Royal pain the the patootie - I used to do it when I used finereader 9. And yes, those line size definition are translated into ODT and DOC files, converting does not solve the problem.

HTML/XML has its own set of glue together problems. I am not fluent in HTML. I know just enough to limp along with it.

So, given my choices, I picked TXT and add back in the italics, ect. Since I am already in the hex editor anyways, adding unicode characters wouldn't be that big of a bother. OTOH, I can add those other language characters, once the text is glue'd together. Yes, I'm using LibreWriter. There characters are only used for place names, and for scientific journal citations, in journals that are not in English. They are usually in Portuguese or French.I needed the umlauts, because some of the authors are German, with umlauts in their last names.

I'm still missing 2 characters - a "c" with a cap on it Think the letter V (upside down, with the point of the v at the top) on the "c", and an "a" with a tilde; you know, the squiggle line over the n in Spanish. (it's a Portuguese feature.)

This is a long haul, if you have better ways to do it, please let me know. (I've done 8 out of 180, so far.)

Tex2002ans · 06-10-2021, 09:36 PM

Quote:

Originally Posted by Greg Anos

Let me fill you in on the background.

Great, thanks for more background information.

It sounds like the core issue is your very first OCR step:

Your Brother OCR is outputting crappy text, so now you're trying to come up with a whole complicated workflow to try to correct THAT mess.

But it's like the solid foundation of a building.

If we adjust that very first step + do it properly, every step after will be easier.

* * *

To come up with a better workflow...

First, a few questions:

1. Do you access to Microsoft Word?

2. Do you still have access to your old copy of Finereader 9?

Notes: If you have Microsoft Word, Toxaris's "EPUB Tools" makes DOCX/RTF OCR cleanup infinitely easier.

If you have Finereader, I can give Finereader-specific instructions.

If no Finereader, then I'd recommend Tesseract (Open Source OCR) instead of your Brother OCR program. (They probably rebranded/based theirs off an outdated version of Tesseract.)

Quote:

Originally Posted by Greg Anos

If you use TXT, goodbye to all you bold, italics, ect. BUT removing the line feed character is a piece of cake for a hex editor.

No, no, no. DO NOT go with TXT.

The underlying formatting (bold/italics/superscript) is just as important as the text itself.

Why? I've written about this in detail back in:

Side Note: Linefeeds are also very easy to remove in RTF, DOCX, TXT, etc. You can use "Advanced Search" or Regular Expressions.

Usually:

/r = Carriage Return
/n = Line Feed

I also use Regular Expressions like:

Search: </p>\s+<p>([a-z])

to search for paragraph breaks that start with a lowercase letter.

No need to go crazy with hex editing files in order to locate/eliminate this stuff.

Quote:

Originally Posted by Greg Anos

My OCR output choices are

TXT
RTF
HTML
XML

Which one would you rather use - to glue together 40 separate pages of text, and make them reflowable?

Out of that bad selection, RTF would most likely work better.

If you have access to Finereader 9, DOCX will be better.

Finereader 10 introduced EPUB output (which is what I used for many years, but now I swear by DOCX -> Toxaris's EPUB Tools).

If you don't have Finereader, then like I said above, probably best to use Tesseract. From there, you'd be able to output better/cleaner files.

Quote:

Originally Posted by Greg Anos

I'm still missing 2 characters - a "c" with a cap on it Think the letter V (upside down, with the point of the v at the top) on the "c", and an "a" with a tilde; you know, the squiggle line over the n in Spanish. (it's a Portuguese feature.)

The "upside down v" is called a Circumflex.

The "squiggle" is called a Tilde.

The "two dots" above is called an Umlaut (or Diaeresis).

(I linked to the fantastic Wikipedia articles on them, they give you nicely organized lists of the letters with accents!)

And here's those 3 characters you mentioned in Unicode:

ã = U+00E3 = LATIN SMALL LETTER A WITH TILDE
ĉ = U+0109 = LATIN SMALL LETTER C WITH CIRCUMFLEX
ö = U+00F6 = LATIN SMALL LETTER O WITH DIAERESIS

Quote:

Originally Posted by Greg Anos

There characters are only used for place names, and for scientific journal citations, in journals that are not in English. They are usually in Portuguese or French.I needed the umlauts, because some of the authors are German, with umlauts in their last names.

If you have Finereader, set the Document Language to:

Code:

English; French; German; Portuguese

If you're using Tesseract, do a similar thing.

DO NOT tell the documents they are "only English". When you enable these other languages, it allows the expanded Alphabets to be used. (As explained in that Fraktur thread I linked to in Post #4.)

One con from enabling other languages is:

you may get slightly more OCR errors introduced
- An "o + two specks of dust" may be confused for 'ö'

but the time saved from OCR getting accented characters right will easily outweigh the time spent manually retyping/correcting.

Quote:

Originally Posted by Greg Anos

This is a long haul, if you have better ways to do it, please let me know. (I've done 8 out of 180, so far.)

You may also be interested in this recent topic posted by anonlivros:

Tutorial-from Paper Book to Ebook PDF - 400 pages in 4 hours

While his tutorial was dealing with how to take pictures to "scan" + clean a book...

I summarized A TON of my "cleanup images and get them OCRed into ebooks" knowledge in there too.

Lots of reading/learning, but I guarantee you'll save way more time in the long-run.

Sarmat89 · 06-10-2021, 10:45 PM

It is always better to use Finereader, as it allows interactive checking and dictionary control (albeit severely nerfed in the latest versions). Never use Tesseract, simple free bundled software, or other such batch crap.
Use RTF which can be edited in any editor, or HTML, which can be edited by regexes and easily styled, as the export format.

Tex2002ans · 06-10-2021, 10:55 PM

Quote:

Originally Posted by Sarmat89

[...] Finereader [...] allows interactive checking and dictionary control (albeit severely nerfed in the latest versions).

Interesting. What's happening with the dictionary?

The same stuff mentioned in "Import dictionary in Finereader" a few months ago?

I'm still back on Finereader 12.

Quote:

Originally Posted by Sarmat89

Never use Tesseract, simple free bundled software, or other such batch crap.

Tesseract is the best free/open-source OCR there is.

If you are on Linux (which OP is) and/or can't afford the expensive/proprietary (Finereader)... then it would have to do.

There are a few GUI frontends for Tesseract too. And while it won't be as nice as Finereader's side-by-side, you can get some comparison between original<->OCR.

But yeah, I mostly agree with avoiding bundled software. Like I mentioned, I bet "Brother's OCR program" is just a heavily outdated version of Tesseract.

Sarmat89 · 06-10-2021, 11:28 PM

Quote:

Originally Posted by Tex2002ans

Interesting. What's happening with the dictionary?

Somewhere between 10 and 12 they dropped the morphology support for user dictionaries. So if your language has 18 forms for a word like "Numenorean", you have to add them all manually (instead of 1 word only).

Quote:

Tesseract is the best free/open-source OCR there is.

And that is sad. It may have been commercial-grade software in 1995, but it is hopelessly outdated now.
In particular, the newest tesseract version does not support italics; the older versions do, but very very badly.

Quoth · 06-11-2021, 08:32 AM

I agree, RTF is always better than plain text. Then edit in Word or LO Writer.

And Finereader is good, some version of Tesseract is next best.

But it's years since I've done OCR. I still have the same Epson Perfection 1200 as maybe since 2002 or 2003 (on SCSI) but on Linux now instead of XP.
I've also a newish massive Brother MF duplex colour laser-copier-scanner, and I don't think there is any advantage with it for OCR. I only have the Brother drivers on the Linux Mint. All the applications are from Mint Distro, Mate version.

I gave up installing "bundled SW" about 15 years ago on Windows. Often outdated, cut down or simply poor versions of the real thing.

Quoth · 06-11-2021, 08:42 AM

Quote:

Originally Posted by Jellby

You can always use Unicode codepoints. For example, suppose you want to input ÿ (y with diaeresis, a rare French letter not present in Quoth's list

). First thing, you search any of a zillion resources for Unicode characters, e.g. https://www.compart.com/en/unicode/U+00FF. There are several ways to get the code/entity.

Compose " y = ÿ
Compose ^ c = ĉ
etc.
I have Compose mapped to the Caps Lock, CAPS LOCK is both shift, either cancels.
The only non-standard thing I added was a table of Compose g <lowercase> and Compose G <uppercase> for the complete Greek alphabet. Obviously Cyrillic could be added simply. There is a standard layout, or you can use the transliterated keys.
I keep meaning to remap AltGr Z and Alt Gr X as they are < and > and would be more usefully prime and double prime. The AltGr z and Alt Gr x is « and ».

Greg Anos · 06-11-2021, 09:05 AM

One more time.

RTF - How do you remove the hard-coded vertical line spacing? I have had the same problem with optiscan and Finereader 9 as I have with my current "junk" setup.

You can't do it by converting to other formats in LibreWriter. Without a fix for this problem, RTF is a non-started for my use. I have to fall back on .txt and re-add the italics, ect., by hand.

I require the font size flexibility of going from 6 point to 24 point, with full legibility.

Greg Anos · 06-11-2021, 10:22 AM

As to my original question, here is the clear, simple answer.

https://www.lifewire.com/typing-char...-marks-1074113

It shows you how to go into the character map in windows (I have virtual machines for both Win 7 and XP. Both based on legitimate retail copies I own.). The map has a very useful help screen. Just select the character you want, copy into the buffer, and then insert the buffer into the document. I'm keeping a second .doc document with the characters I need (and commonly used words) and do a scrape and paste.

Jellby · 06-11-2021, 10:38 AM

Quote:

Originally Posted by Tex2002ans

In EPUB, the only special entity you have to worry about is the Non-Breaking Space (  or  ).

And not even that. You can input that character directly too (the problem is in your input method[*] and code-editing software, not in the EPUB/XML format).

Actually, the only entities one should really need are < & >, because their characters (< & >) would be XML code otherwise.

[*] In vim, I have set two shortcuts (F7 and Shift+F7) to convert from and to entities using recode. That way I can enter …, press Shift+F7 and have it converted to …, no need to copy-paste, learn codes (other than the mnemonic entities), or change the system configuration. I can also copy paste some fancy text (e.g. "Tiếng Việt") and get its character "decomposition" with F7 (Tiếng Việt).

Quoth · 06-11-2021, 11:23 AM

Quote:

Originally Posted by Greg Anos

One more time.

RTF - How do you remove the hard-coded vertical line spacing? I have had the same problem with optiscan and Finereader 9 as I have with my current "junk" setup.

Edit the style!
There is no hardcoded RTF linespacing that can't easily be overridden in Word or Writer.
Import or open RTF
Save As (doc, docx or odt depending on if older Word, newer Word or LO Writer).
Re-open saved document.
Edit style used for the default body text.
Find headings and give them suitable styles.

Also removing hard line breaks and only having paragraph breaks is trivial.

If using writer, do an EXTRA saveAS for Calibre to convert Docx to epub.

I use RTF any time I need to reformat evil formatted sources.

06-10-2021, 11:04 AM	#1
Greg Anos Grand Sorcerer Posts: 11,248 Karma: 35000000 Join Date: Jan 2008 Device: Pocketbook	defining non english characters in an english epub I am doing a long project of scanning and converting to ePub 30 years of a specialty hobby journal (pro bono publico). I need to use the occasional non-english character (letter with tilde, umlat, and French letter characters). Is there a hex guide for defining these character sets (as hex strings)? (Like em dashes being defined a a 3 hex character string.) Thanks.

06-10-2021, 11:52 AM	#2
Quoth the rook, bossing Never. Posts: 11,158 Karma: 85874891 Join Date: Jun 2017 Location: Ireland Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11	I use LibreOffice Writer. I've had no problem with any epub reader, or conversion to mobi or azw with any accented letter used in Europe-Scandinavia Iceland. I just type them! Some use the Linux Compose key. The ONLY ones I can't type are the prime and double prime: ′ ″ so I use “Insert special character” or a charmap tool. hyphen - en – em — ellipsis … áéíóú ÁÉÍÓÚ àèìòù ÀÈÌÒÙ üöëäï ł Ł þ Þ ŧ Ŧ Ω ẃ Ẃ ý Ý ß § «»ç Ç “” ‘ ’ ñ Ñ ð Ð đ ª ŋ Ŋ ħ Ħ ĸ You can copy & paste if you are on Windows or Mac. Try Spanish Keyboard layout on Windows. Above covers Scandinavian/Icelandic, Polish, French, German, Spanish, Irish etc. I may have missed a few. Some greek letters will work. But not all, nor Cyrillic. αβδ etc Or are you looking for HTML entities? I always edit using odt format with LO Writer, do an extra save As docx and use Calibre to make the epub. I MIGHT occasionally edit CSS, but not needed if Styles in LO Writer used correctly. I never edit HTML now, not for over 15 years. Last edited by Quoth; 06-10-2021 at 11:54 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Latin non-English characters in titles	hfpop	Kobo Developer's Corner	11	05-29-2018 05:17 PM
PRS-T1 Non English glyphs (AKA characters)	Ghitulescu	Sony Reader	11	09-12-2014 06:41 AM
PRS-650 English text with some non-English characters show as ?	Gorit	Sony Reader	1	03-06-2012 08:39 AM
Option to "convert non-English characters to English Equivalents"	riverteeth	Library Management	5	10-29-2011 06:25 AM
Non-English characters in title / author	lejuan	Calibre	7	01-18-2010 03:52 PM

06-10-2021, 12:11 PM	#3
Jellby frumious Bandersnatch Posts: 7,516 Karma: 18512745 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	You can always use Unicode codepoints. For example, suppose you want to input ÿ (y with diaeresis, a rare French letter not present in Quoth's list ). First thing, you search any of a zillion resources for Unicode characters, e.g. https://www.compart.com/en/unicode/U+00FF. There are several ways to get the code/entity. 1. From the "U+00FF", this means the code is 00FF, which is an hexadecimal number. As with decimal numbers, leading zeros can be removed, so it's "FF", just prepend &#x and append a semicolon and you're done: ÿ (case insensitive). 2. Convert the code to decimal, if it's not given in the page. In this case, since F = 15, we get 15*16+15 = 255. Now just prepend &# (note, no x here) and append a semicolon: ÿ. 3. Most pages will directly give you the HTML codes: ÿ ÿ ÿ. Just avoid the "mnemonic" one (ÿ), which is for true HTML files, and might not be supported in ebook readers, and pick any of the numeric ones.

06-10-2021, 10:45 PM	#7
Sarmat89 Evangelist Posts: 482 Karma: 2267928 Join Date: Nov 2015 Device: none	It is always better to use Finereader, as it allows interactive checking and dictionary control (albeit severely nerfed in the latest versions). Never use Tesseract, simple free bundled software, or other such batch crap. Use RTF which can be edited in any editor, or HTML, which can be edited by regexes and easily styled, as the export format.

06-11-2021, 08:32 AM	#10
Quoth the rook, bossing Never. Posts: 11,158 Karma: 85874891 Join Date: Jun 2017 Location: Ireland Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11	I agree, RTF is always better than plain text. Then edit in Word or LO Writer. And Finereader is good, some version of Tesseract is next best. But it's years since I've done OCR. I still have the same Epson Perfection 1200 as maybe since 2002 or 2003 (on SCSI) but on Linux now instead of XP. I've also a newish massive Brother MF duplex colour laser-copier-scanner, and I don't think there is any advantage with it for OCR. I only have the Brother drivers on the Linux Mint. All the applications are from Mint Distro, Mate version. I gave up installing "bundled SW" about 15 years ago on Windows. Often outdated, cut down or simply poor versions of the real thing.

06-11-2021, 09:05 AM	#12
Greg Anos Grand Sorcerer Posts: 11,248 Karma: 35000000 Join Date: Jan 2008 Device: Pocketbook	One more time. RTF - How do you remove the hard-coded vertical line spacing? I have had the same problem with optiscan and Finereader 9 as I have with my current "junk" setup. You can't do it by converting to other formats in LibreWriter. Without a fix for this problem, RTF is a non-started for my use. I have to fall back on .txt and re-add the italics, ect., by hand. I require the font size flexibility of going from 6 point to 24 point, with full legibility.

06-11-2021, 10:22 AM	#13
Greg Anos Grand Sorcerer Posts: 11,248 Karma: 35000000 Join Date: Jan 2008 Device: Pocketbook	As to my original question, here is the clear, simple answer. https://www.lifewire.com/typing-char...-marks-1074113 It shows you how to go into the character map in windows (I have virtual machines for both Win 7 and XP. Both based on legitimate retail copies I own.). The map has a very useful help screen. Just select the character you want, copy into the buffer, and then insert the buffer into the document. I'm keeping a second .doc document with the characters I need (and commonly used words) and do a scrape and paste.