06-10-2021, 11:04 AM | #1 |
Grand Sorcerer
Posts: 11,248
Karma: 35000000
Join Date: Jan 2008
Device: Pocketbook
|
defining non english characters in an english epub
I am doing a long project of scanning and converting to ePub 30 years of a specialty hobby journal (pro bono publico). I need to use the occasional non-english character (letter with tilde, umlat, and French letter characters).
Is there a hex guide for defining these character sets (as hex strings)? (Like em dashes being defined a a 3 hex character string.) Thanks. |
06-10-2021, 11:52 AM | #2 |
the rook, bossing Never.
Posts: 11,158
Karma: 85874891
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
I use LibreOffice Writer. I've had no problem with any epub reader, or conversion to mobi or azw with any accented letter used in Europe-Scandinavia Iceland.
I just type them! Some use the Linux Compose key. The ONLY ones I can't type are the prime and double prime: ′ ″ so I use “Insert special character” or a charmap tool. hyphen - en – em — ellipsis … áéíóú ÁÉÍÓÚ àèìòù ÀÈÌÒÙ üöëäï ł Ł þ Þ ŧ Ŧ Ω ẃ Ẃ ý Ý ß § «»ç Ç “” ‘ ’ ñ Ñ ð Ð đ ª ŋ Ŋ ħ Ħ ĸ You can copy & paste if you are on Windows or Mac. Try Spanish Keyboard layout on Windows. Above covers Scandinavian/Icelandic, Polish, French, German, Spanish, Irish etc. I may have missed a few. Some greek letters will work. But not all, nor Cyrillic. αβδ etc Or are you looking for HTML entities? I always edit using odt format with LO Writer, do an extra save As docx and use Calibre to make the epub. I MIGHT occasionally edit CSS, but not needed if Styles in LO Writer used correctly. I never edit HTML now, not for over 15 years. Last edited by Quoth; 06-10-2021 at 11:54 AM. |
06-10-2021, 12:11 PM | #3 |
frumious Bandersnatch
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
You can always use Unicode codepoints. For example, suppose you want to input ÿ (y with diaeresis, a rare French letter not present in Quoth's list ). First thing, you search any of a zillion resources for Unicode characters, e.g. https://www.compart.com/en/unicode/U+00FF. There are several ways to get the code/entity.
1. From the "U+00FF", this means the code is 00FF, which is an hexadecimal number. As with decimal numbers, leading zeros can be removed, so it's "FF", just prepend &#x and append a semicolon and you're done: ÿ (case insensitive). 2. Convert the code to decimal, if it's not given in the page. In this case, since F = 15, we get 15*16+15 = 255. Now just prepend &# (note, no x here) and append a semicolon: ÿ. 3. Most pages will directly give you the HTML codes: ÿ ÿ ÿ. Just avoid the "mnemonic" one (ÿ), which is for true HTML files, and might not be supported in ebook readers, and pick any of the numeric ones. |
06-10-2021, 01:53 PM | #4 | |||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
In EPUB, the only special entity you have to worry about is the Non-Breaking Space ( or  ). Everything else can use the actual Unicode characters: — = Em Dash There's no need to clutter your code with —. Quote:
Quote:
OCR outputs:
but your actual article says:
Usually, if you enable the proper OCR languages, these accented characters will be recognized. Side Note: I wrote a bit about OCR + German/Spanish/French accents in "Abbyy Finereader 15 gothic/Fraktur Altdeutsch/Oldgerman" (Post #5). For example, English only recognizes A-Z while Spanish will recognize A-Z + a few more:
So if you have a 98% English book with 2% Spanish names/words, you'd tell OCR this is an English AND Spanish book. This would catch all the little accents on the ñ and á and é. Last edited by Tex2002ans; 06-10-2021 at 01:59 PM. |
|||
06-10-2021, 06:30 PM | #5 | |
Grand Sorcerer
Posts: 11,248
Karma: 35000000
Join Date: Jan 2008
Device: Pocketbook
|
Quote:
Let me fill you in on the background. I used to have a Optibook (scan?) dedicated scanner + ABBYY Finereader 9. It did reasonably good, until the scanner died. I refuse to spend more money for another dedicated scanner. I currently have Brothers multi-function scanner/printer/copier hooked up. It has its own OCR software. I did some test shots, and it OCR'ed better than Finereader 9. However - I want fully reflowable print streams. I want the e-book reader to be able to change font size with no problems. That's where it gets real sticky. To do fully reflowable text, there can be no hard line breaks(feeds) other that paragraph breaks. Otherwise you get into the "long line plus short line" when the font overflows The line length. aaaaaaaaaaaa bbbbbbbbbbbb Becomes aaaaaaaaaa aa bbbbbbbbbb bb Which is not acceptable. (to me, anyways - and I'M doing the work!) My OCR output choices are TXT RTF HTML XML Which one would you rather use - to glue together 40 separate pages of text, and make them reflowable? If you use TXT, goodbye to all you bold, italics, ect. BUT removing the line feed character is a piece of cake for a hex editor. (which I have in a virtualbox XP machine, which I run under Linux). No control characters at all, so LibreOffice has no problems with expanding and contracting font size (or type). RTF is FULL of control characters, of ALL sorts. Yes, you can blow up the font size, but the text steps on itself, because there are other control characters that don't change, that control the line spacing. Play with it yourself to see what I mean. Ripping out the RTF control data is a Royal pain the the patootie - I used to do it when I used finereader 9. And yes, those line size definition are translated into ODT and DOC files, converting does not solve the problem. HTML/XML has its own set of glue together problems. I am not fluent in HTML. I know just enough to limp along with it. So, given my choices, I picked TXT and add back in the italics, ect. Since I am already in the hex editor anyways, adding unicode characters wouldn't be that big of a bother. OTOH, I can add those other language characters, once the text is glue'd together. Yes, I'm using LibreWriter. There characters are only used for place names, and for scientific journal citations, in journals that are not in English. They are usually in Portuguese or French.I needed the umlauts, because some of the authors are German, with umlauts in their last names. I'm still missing 2 characters - a "c" with a cap on it Think the letter V (upside down, with the point of the v at the top) on the "c", and an "a" with a tilde; you know, the squiggle line over the n in Spanish. (it's a Portuguese feature.) This is a long haul, if you have better ways to do it, please let me know. (I've done 8 out of 180, so far.) |
|
06-10-2021, 09:36 PM | #6 | |||||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Great, thanks for more background information.
It sounds like the core issue is your very first OCR step: Your Brother OCR is outputting crappy text, so now you're trying to come up with a whole complicated workflow to try to correct THAT mess. But it's like the solid foundation of a building. If we adjust that very first step + do it properly, every step after will be easier. * * * To come up with a better workflow... First, a few questions: 1. Do you access to Microsoft Word? 2. Do you still have access to your old copy of Finereader 9? Notes: If you have Microsoft Word, Toxaris's "EPUB Tools" makes DOCX/RTF OCR cleanup infinitely easier. If you have Finereader, I can give Finereader-specific instructions. If no Finereader, then I'd recommend Tesseract (Open Source OCR) instead of your Brother OCR program. (They probably rebranded/based theirs off an outdated version of Tesseract.) Quote:
The underlying formatting (bold/italics/superscript) is just as important as the text itself. Why? I've written about this in detail back in:
Side Note: Linefeeds are also very easy to remove in RTF, DOCX, TXT, etc. You can use "Advanced Search" or Regular Expressions. Usually: /r = Carriage Return /n = Line Feed I also use Regular Expressions like: Search: </p>\s+<p>([a-z]) to search for paragraph breaks that start with a lowercase letter. No need to go crazy with hex editing files in order to locate/eliminate this stuff. Quote:
If you have access to Finereader 9, DOCX will be better. Finereader 10 introduced EPUB output (which is what I used for many years, but now I swear by DOCX -> Toxaris's EPUB Tools). If you don't have Finereader, then like I said above, probably best to use Tesseract. From there, you'd be able to output better/cleaner files. Quote:
The "squiggle" is called a Tilde. The "two dots" above is called an Umlaut (or Diaeresis). (I linked to the fantastic Wikipedia articles on them, they give you nicely organized lists of the letters with accents!) And here's those 3 characters you mentioned in Unicode: ã = U+00E3 = LATIN SMALL LETTER A WITH TILDE ĉ = U+0109 = LATIN SMALL LETTER C WITH CIRCUMFLEX ö = U+00F6 = LATIN SMALL LETTER O WITH DIAERESIS Quote:
Code:
English; French; German; Portuguese DO NOT tell the documents they are "only English". When you enable these other languages, it allows the expanded Alphabets to be used. (As explained in that Fraktur thread I linked to in Post #4.) One con from enabling other languages is:
but the time saved from OCR getting accented characters right will easily outweigh the time spent manually retyping/correcting. Quote:
While his tutorial was dealing with how to take pictures to "scan" + clean a book... I summarized A TON of my "cleanup images and get them OCRed into ebooks" knowledge in there too. Lots of reading/learning, but I guarantee you'll save way more time in the long-run. Last edited by Tex2002ans; 06-10-2021 at 10:42 PM. |
|||||
06-10-2021, 10:45 PM | #7 |
Evangelist
Posts: 482
Karma: 2267928
Join Date: Nov 2015
Device: none
|
It is always better to use Finereader, as it allows interactive checking and dictionary control (albeit severely nerfed in the latest versions). Never use Tesseract, simple free bundled software, or other such batch crap.
Use RTF which can be edited in any editor, or HTML, which can be edited by regexes and easily styled, as the export format. |
06-10-2021, 10:55 PM | #8 | ||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
The same stuff mentioned in "Import dictionary in Finereader" a few months ago? I'm still back on Finereader 12. Quote:
If you are on Linux (which OP is) and/or can't afford the expensive/proprietary (Finereader)... then it would have to do. There are a few GUI frontends for Tesseract too. And while it won't be as nice as Finereader's side-by-side, you can get some comparison between original<->OCR. But yeah, I mostly agree with avoiding bundled software. Like I mentioned, I bet "Brother's OCR program" is just a heavily outdated version of Tesseract. Last edited by Tex2002ans; 06-10-2021 at 10:57 PM. |
||
06-10-2021, 11:28 PM | #9 | |
Evangelist
Posts: 482
Karma: 2267928
Join Date: Nov 2015
Device: none
|
Somewhere between 10 and 12 they dropped the morphology support for user dictionaries. So if your language has 18 forms for a word like "Numenorean", you have to add them all manually (instead of 1 word only).
Quote:
In particular, the newest tesseract version does not support italics; the older versions do, but very very badly. |
|
06-11-2021, 08:32 AM | #10 |
the rook, bossing Never.
Posts: 11,158
Karma: 85874891
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
I agree, RTF is always better than plain text. Then edit in Word or LO Writer.
And Finereader is good, some version of Tesseract is next best. But it's years since I've done OCR. I still have the same Epson Perfection 1200 as maybe since 2002 or 2003 (on SCSI) but on Linux now instead of XP. I've also a newish massive Brother MF duplex colour laser-copier-scanner, and I don't think there is any advantage with it for OCR. I only have the Brother drivers on the Linux Mint. All the applications are from Mint Distro, Mate version. I gave up installing "bundled SW" about 15 years ago on Windows. Often outdated, cut down or simply poor versions of the real thing. |
06-11-2021, 08:42 AM | #11 | |
the rook, bossing Never.
Posts: 11,158
Karma: 85874891
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
Quote:
Compose ^ c = ĉ etc. I have Compose mapped to the Caps Lock, CAPS LOCK is both shift, either cancels. The only non-standard thing I added was a table of Compose g <lowercase> and Compose G <uppercase> for the complete Greek alphabet. Obviously Cyrillic could be added simply. There is a standard layout, or you can use the transliterated keys. I keep meaning to remap AltGr Z and Alt Gr X as they are < and > and would be more usefully prime and double prime. The AltGr z and Alt Gr x is « and ». |
|
06-11-2021, 09:05 AM | #12 |
Grand Sorcerer
Posts: 11,248
Karma: 35000000
Join Date: Jan 2008
Device: Pocketbook
|
One more time.
RTF - How do you remove the hard-coded vertical line spacing? I have had the same problem with optiscan and Finereader 9 as I have with my current "junk" setup. You can't do it by converting to other formats in LibreWriter. Without a fix for this problem, RTF is a non-started for my use. I have to fall back on .txt and re-add the italics, ect., by hand. I require the font size flexibility of going from 6 point to 24 point, with full legibility. |
06-11-2021, 10:22 AM | #13 |
Grand Sorcerer
Posts: 11,248
Karma: 35000000
Join Date: Jan 2008
Device: Pocketbook
|
As to my original question, here is the clear, simple answer.
https://www.lifewire.com/typing-char...-marks-1074113 It shows you how to go into the character map in windows (I have virtual machines for both Win 7 and XP. Both based on legitimate retail copies I own.). The map has a very useful help screen. Just select the character you want, copy into the buffer, and then insert the buffer into the document. I'm keeping a second .doc document with the characters I need (and commonly used words) and do a scrape and paste. |
06-11-2021, 10:38 AM | #14 | |
frumious Bandersnatch
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Quote:
Actually, the only entities one should really need are < & >, because their characters (< & >) would be XML code otherwise. [*] In vim, I have set two shortcuts (F7 and Shift+F7) to convert from and to entities using recode. That way I can enter …, press Shift+F7 and have it converted to …, no need to copy-paste, learn codes (other than the mnemonic entities), or change the system configuration. I can also copy paste some fancy text (e.g. "Tiếng Việt") and get its character "decomposition" with F7 (Tiếng Việt). |
|
06-11-2021, 11:23 AM | #15 | |
the rook, bossing Never.
Posts: 11,158
Karma: 85874891
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
Quote:
There is no hardcoded RTF linespacing that can't easily be overridden in Word or Writer. Import or open RTF Save As (doc, docx or odt depending on if older Word, newer Word or LO Writer). Re-open saved document. Edit style used for the default body text. Find headings and give them suitable styles. Also removing hard line breaks and only having paragraph breaks is trivial. If using writer, do an EXTRA saveAS for Calibre to convert Docx to epub. I use RTF any time I need to reformat evil formatted sources. Last edited by Quoth; 06-11-2021 at 11:26 AM. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Latin non-English characters in titles | hfpop | Kobo Developer's Corner | 11 | 05-29-2018 05:17 PM |
PRS-T1 Non English glyphs (AKA characters) | Ghitulescu | Sony Reader | 11 | 09-12-2014 06:41 AM |
PRS-650 English text with some non-English characters show as ? | Gorit | Sony Reader | 1 | 03-06-2012 08:39 AM |
Option to "convert non-English characters to English Equivalents" | riverteeth | Library Management | 5 | 10-29-2011 06:25 AM |
Non-English characters in title / author | lejuan | Calibre | 7 | 01-18-2010 03:52 PM |