, or for italics ? - Page 17

Tex2002ans · 07-20-2022, 10:54 PM

Quote:

Originally Posted by bookman156

Fantastic, that's useful.

No problem.

Quote:

Originally Posted by bookman156

By the way, if you set Chinese with English in books with InDesign here's a fantastic technique that's good to know about. It automatically finds the Chinese and puts it in your desired Chinese font using GREP (I prefer 'Noto Sans CJK TC' for traditional Chinese):

https://www.scammell.co.uk/2012/04/2...dobe-indesign/

Thanks.

In Sigil/Calibre, you can use regex to look for "Unicode properties":

Code:

\p{Han}
\p{Greek}

1st would look for Chinese characters, 2nd matches Greek characters.

That's a super advanced regex thing though.

* * *

Finding All Chinese Characters in Sigil/Calibre

You could use this regex:

Search: (\p{Han}+)
Replace: \1

and that would tag all Chinese words in a single shot.

zh-Hans = Chinese (Simplified)
zh-Hant = Chinese (Traditional)

- - -

Side Note: In any ebook, I mostly just use Sigil's Tools > Reports > Characters in HTML Files and look for anything suspicious.

Definitely a good habit to get into, because so many times authors sneak garbage in, like:

... and all sorts of other crap, especially when they paste from other documents or online sources.

That report allows you to just see every single character used in the ebook, at a glance.

- - -

* * *

Fonts are one of the biggest reasons why I began tagging languages.

A few books I worked on had a handful of Polytonic Greek words, so it was easy to spot/tag them + they're one of the rare use-cases where you may want to embed a font.

(Most fonts have basic Greek letters, but not all the ones with the crazy accents on them.)

That's what started the snowball rolling.

And here we are all these years later. Now I'm a fiend!

I must admit, tagging languages is still a pain, and not THAT beneficial yet. (Sucks up lots of time with minimal benefit.) But, I do it on the easy stuff:

Chapters
- British article in an otherwise American journal.
Entire paragraph/blockquotes
Poems/Lyrics
Handful of words in completely different alphabets
- Greek, Chinese/Japanese, etc.
- (Very easy to apply fonts + subset them.)

Trying to tag everything down to the word-level is... yeah... I described the real-life problems with that in the Reddit thread.

Someone was complaining about LibreOffice "should just automatically detect/tag words... like Google Translate!!!". I believe I put him in his place lol.

The way LibreOffice handles it though is pretty genius, and I never even thought of it:

If your keyboard changes between languages, the cursor will swap languages too.

For those who swap between multiple languages while typing their documents, that sounds a pretty decent compromise to me.

* * *

Do you speak or read Chinese?

If so, I may still need your help in that 2020 "Should Chinese Fonts be Embedded?" thread.

(That project has been sitting dormant for a few years, and since I don't read/write Chinese...)

Did I tag the Simplified/Traditional Chinese characters properly?

Does my sample EPUB characters match the PDF? (Or did the person who created the direct formatting in the DOC potentially botch it up? Because they did have the foreign characters marked as "French", so I don't trust them!!! lol.)

bookman156 · 07-20-2022, 11:23 PM

Quote:

and that would tag all Chinese words in a single shot.

Wow, I was wondering whether there was a good way to do that. But presumably it would take each character of a phrase rather than the whole phrase.

Quote:

Do you speak or read Chinese?

I read traditional Chinese, with some help from a dictionary or the Wenlin program, which I use for typesetting Chinese.

Quote:

Did I tag the Simplified/Traditional Chinese characters properly?

Here, I did notice this:

Code:

Liu E, also known as Liu Tieyun <span class="japanese" lang="ja" xml:lang="ja">劉鐵雲</span>, was born in 1857 at Liuhe <span class="japanese" lang="ja" xml:lang="ja">六合</span> county in what is today Nanjing <span class="japanese" lang="ja" xml:lang="ja">南京</span>.

you're tagging traditional Chinese as Japanese.

As a general rule, an English paper about ancient China or history with Chinese characters will be be traditional Chinese.

Quote:

Does my sample EPUB characters match the PDF?

I looked at the EPUB, you're tagging it as Hans when it should be Hant. Sometimes simple and traditional characters are the same, but I didn't check them. It all looks like traditional Chinese to me. For instance take li 禮, which you have in there, that's traditional, the simple is 礼. So you're getting traditional Chinese even though you've tagged it as simple. That's because you're using the right Unicode character. Certainly the Chinese that I looked at was correct. Oddly enough the paper was on a subject I have written on myself, so that was kinda interesting.

bookman156 · 07-20-2022, 11:35 PM

Actually you've tagged a lot of traditional Chinese as Japanese. Is there actually any Japanese in there? Of course Japanese shares some Chinese characters, kanji.

bookman156 · 07-20-2022, 11:37 PM

Okay, I looked all the way through. There isn't any Japanese, it's all traditional Chinese. So that should facilitate finishing your project off.

bookman156 · 07-20-2022, 11:49 PM

No, there is one little bit of Japanese:

Quote:

“No” to ieru Ajia 「No」と言えるアジア [The Asia That Says ‘No’]

I didn't go into detail, but noticed:

Quote:

bodyguard (baopiao 保鏢)

The pinyin should be baobiao.

bookman156 · 07-21-2022, 12:08 AM

In this:

Quote:

Japanese intellectual Okakura Kakuzō 岡倉覚三 (1863–1913)

This is kanji, traditional Chinese characters in Japanese, so Japanese is right. Japanese names would be the only exception then to it being traditional Chinese.

Tex2002ans · 07-21-2022, 01:20 AM

Quote:

Originally Posted by bookman156

Wow, I was wondering whether there was a good way to do that. But presumably it would take each character of a phrase rather than the whole phrase.

The '+' sign in regex means "ONE OR MORE of the previous thing".

So the regex I gave would:

Find "ONE OR MORE Chinese characters in a row."
When you replace it, "Wrap the entire chunk in a ".

Very similar to what was written in that Adobe InDesign GREP method in that article of yours:

Code:

[\x{2E80}-\x{9FBB}\x{3000}-\x{303F}\x{FF01}-\x{FF60}]+

There are 3 major parts. I'll explain the easy ones first.

The Easy Parts

Brackets and the plus sign are special symbols in regular expressions!

[]
- = "Look for a single character that matches what's in between the open/close brackets."
+
- = "Look for ONE OR MORE of the previous thing."

The "Hard" Parts

All of that \x{} stuff is a (scary-looking) way to search for specific Unicode characters.

For example, if you wrote a simple:

[a-z]
- In English, that means "Look for any character BETWEEN 'a' and 'z'."
  - So it would match 'a' OR 'b' OR 'c' OR ... OR 'y' OR 'z'.

Each letter in Unicode gets assigned numbers:

a = 0061
b = 0062
c = 0063
d = 0064
[...]
z = 007A

So when you write:

[\x{2E80}-\x{9FBB}]

it's saying:

"Look for any character between:

⺀ = 2E80
[...]
龻 = 9FBB

So everything in that GREP, sandwiched in between those brackets, is just a big long list of:

"Everything between these characters/numbers to those characters/numbers."

Sticking It All Together

[ = "Look for any of the characters that are between the open/close brackets."
- \x{2E80}-\x{9FBB}
- = "any character between 2E80 and 9FBB."
- \x{3000}-\x{303F}
- = "any character between 3000 and 303F."
- \x{FF01}-\x{FF60}
- = "any character between FF01 and FF60."
] = "That's the end of my giant list of characters."
+ = "Okay, now keep looking for ONE OR MORE of the characters in that giant list."

(I assume that's just a lot of codes for Chinese characters. Unsure how up-to-date or accurate it is though. Unicode, each year, is always getting updates/additions/revisions. Since that article was written in 2012, we've gone from Unicode 6.1 -> Unicode 14.0.)

Quote:

Originally Posted by bookman156

Here, I did notice this:

[...]

Thanks. I'll definitely have to go back and adjust my latest EPUB based on your notes.

I might be able to get to redoing that EPUB this weekend. (If not, then in a few weeks.)

Let's continue that discussion in a Private Message if needed.

Quote:

Originally Posted by bookman156

[...] you're tagging traditional Chinese as Japanese.

Yeah, I guessed the language based on the original font information from the DOC.

(In Post #3 of that thread, I explained which CJK Microsoft fonts seemed to be assigned to which language in legacy DOC documents.)

Quote:

Originally Posted by bookman156

Oddly enough the paper was on a subject I have written on myself, so that was kinda interesting.

Cool!

Yeah, I haven't read through those documents yet. I finished about 99%+ of the conversion, and was doing finishing touches before tossing it on my device to proofread.

Side Note: And, my friend, you have got to combine your posts instead of 100 little mini-ones. lol.

bookman156 · 07-21-2022, 01:50 AM

Quote:

(I assume that's just a lot of codes for Chinese characters. Unsure how up-to-date or accurate it is though. Unicode, each year, is always getting updates/additions/revisions. Since that article was written in 2012, we've gone from Unicode 6.1 -> Unicode 14.0.)

Yes, the Chinese characters plus spaces. Unicode revisions don't matter here because the characters themselves are already in InDesign and correct, they just aren't showing because they haven't yet had a font supplied because I've selected all at first and made it the paragraph style for the English. Then the GREP search styles characters in those ranges with a Chinese font. The Chinese typesetting is done outside of InDesign and is already checked, it only needs a font.

I haven't yet exported an InDesign file with Chinese as EPUB, but the Chinese would have a span name so I guess that could be replaced easily in Atom or something with the span containing the language tags. Not sure if InDesign would put the span around individual characters or phrases.

Tex2002ans · 07-21-2022, 03:11 PM

Quote:

Originally Posted by bookman156

Yes, the Chinese characters plus spaces. Unicode revisions don't matter here [...]. Then the GREP search styles characters in those ranges with a Chinese font.

Looking a bit closer, it looks like there's been ~10,000 new CJK characters added to Unicode since then.

(~5,700 in Unicode 8.0 + ~5,000 in Unicode 13.0.)

(And ~4,200 more CJK characters are going to be added in Unicode 15.0, which will be coming out later this year.)

That \x{} numbers method would fail, if it doesn't cover all those new cases.

Where \p{Han} would detect all characters, as long as the program understands the latest Unicode.

Quote:

Originally Posted by bookman156

I haven't yet exported an InDesign file with Chinese as EPUB, but the Chinese would have a span name [...]. Not sure if InDesign would put the span around individual characters or phrases.

I'm unsure too, but if it's anything like what I've seen, it'll be ugly! lol.

But if you want to make your life easier...

Make sure you create a Character Style.

You can:

Give it an easy name, like "chinese".
Assign it a CJK font.
Mark as "Chinese" language.
- If exporting to PDF (or HTML), this is very important.

That will make it much easier to convert to clean HTML +classes.

(InDesign also has this great thing called "Style Mapping" which is an enormous help too... if you use your Styles properly!)

- - - - -

Complete Side Note: One really ugly thing I just learned in Microsoft Word.

If you type a link, like:

Code:

http://www.example.com/

Then come back to it at a later date and add text between:

Code:

http://www.exa123mple.com/

in the internals, Word splits it into 3 chunks:

http://www.exa
123
mple.com/

and points all 3 pieces to the same exact URL.

So instead of this in your HTML:

Code:

<a href="http://www.exa123mple.com/">http://www.exa123mple.com/</a>

you would have this:

Code:

<a href="http://www.exa123mple.com/">http://www.exa</a><a href="http://www.exa123mple.com/">123</a><a href="http://www.exa123mple.com/">mple.com/</a>

(LibreOffice recently fixed this for 7.5.)

I'm betting InDesign has all sorts of mess like that too.

And this could explain some of the real disastrous documents I've gotten, where there are millions of overlapping s which seem to all be the same code.

bookman156 · 07-21-2022, 07:01 PM

Quote:

Looking a bit closer, it looks like there's been ~10,000 new CJK characters added to Unicode since then.

(~5,700 in Unicode 8.0 + ~5,000 in Unicode 13.0.)

(And ~4,200 more CJK characters are going to be added in Unicode 15.0, which will be coming out later this year.)

That \x{} numbers method would fail, if it doesn't cover all those new cases.

All the characters I am so far using were already in Unicode ages ago, so effectively I'm still a decade and a half ago. Wenlin gives the Unicode for each character I use and when I take an interest I see they're in the range already specified. But certainly it would be useful on a major new book project to look at what the current ranges are (but only for traditional, I don't need simple and I guess much of the new ones are simple). The fact that my copy of Wenlin has the character means it's in the old range. I'm not sure what these new characters are. In Chinese, believe it or not, you can actually get a dictionary several inches thick of ancient characters no-one knows the meaning of! What a great idea for a dictionary.

Yes, certainly I make a character style for Chinese. Then that will become the style name of the exported span class.

Tex2002ans · 07-21-2022, 10:16 PM

Quote:

Originally Posted by bookman156

Yes, certainly I make a character style for Chinese. Then that will become the style name of the exported span class.

Most people don't use Styles, so just wanted to make sure.

Quote:

Originally Posted by bookman156

I'm not sure what these new characters are.

Here's all the characters coming down the pipeline in Unicode 15.0:

https://www.unicode.org/charts/PDF/Unicode-15.0/

Looks like it's all in the "CJK Unified Ideographs Extension H" section (click for PDF).

Similar can be found for:

Unicode 13.0
- CJK Extension G
Unicode 10.0
- CJK Extension F
Unicode 8.0
- CJK Extension E

If you visit that page + click on the categories, you can get PDFs showing you every newly accepted character highlighted in yellow.

They also have the page:

UAX #38: Unicode Han Database (Unihan)

which lists/describes all the CJK sources in detail.

(Looks like 11.0 + 13.0 added lots of characters from a 2013 document published by the Chinese government.)

Quote:

Originally Posted by bookman156

In Chinese, believe it or not, you can actually get a dictionary several inches thick of ancient characters no-one knows the meaning of! What a great idea for a dictionary.

Well, the mission is to digitize and preserve all the languages in the world.

And the characters must exist in some authoritative source somewhere, no matter how rare.

Unicode has quite a high bar to get new ones accepted.

Heck, it was only in 2009 (Unicode 5.2) when Egyptian Hieroglyphics were added!

bookman156 · 07-21-2022, 11:19 PM

Fascinating. Traditional Chinese, some of which probably no-one on the planet knows how to pronounce. Also some unusual characters that are quite amazing designs. Most of the characters from the classics were done ages ago, probably we're getting on to obscure place-names and one-pig villages now.

Quoth · 07-22-2022, 08:13 AM

People at railway stations using notepads (or the temporary LCD like erasable wax) because they understand the written language, but they talk in their own language, so I'd imagine written Chinese isn't usually pronounced at all unless it's some deliberate phonetic transliteration. Maybe nearly 300 actually used spoken languages in China.

bookman156 · 07-22-2022, 10:16 AM

The written language is spoken, international pinyin represents its pronunciation (Beijing, rather than Peking in the old Wade-Giles system, for example), though Cantonese is spoken differently. But there are plenty who speak Chinese who can't read or write it.

There's a saying in China that if you travel a hundred miles no-one will understand you. That's why people get out their notebooks to draw the characters. Same as me saying 'Spell it' when I can't understand someone's dialect.

07-22-2022, 10:16 AM	#254
bookman156 Addict Posts: 368 Karma: 1000000 Join Date: Mar 2016 Device: none	The written language is spoken, international pinyin represents its pronunciation (Beijing, rather than Peking in the old Wade-Giles system, for example), though Cantonese is spoken differently. But there are plenty who speak Chinese who can't read or write it. There's a saying in China that if you travel a hundred miles no-one will understand you. That's why people get out their notebooks to draw the characters. Same as me saying 'Spell it' when I can't understand someone's dialect. Last edited by bookman156; 07-22-2022 at 10:26 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Span Span Span Sigil cleaning up indesign	blackest	Sigil	31	12-06-2017 10:16 AM
Span Span Span Span	MULTIVAC	ePub	7	12-06-2014 08:58 AM
Nested Span?	Turtle91	ePub	4	05-20-2013 02:47 PM
span in span: is this problematic?	tbuyus	ePub	8	03-31-2013 08:01 AM
Remove <br /> together with span, and only span	Razzia	Recipes	3	05-30-2011 06:55 PM

07-20-2022, 11:35 PM	#243
bookman156 Addict Posts: 368 Karma: 1000000 Join Date: Mar 2016 Device: none	Actually you've tagged a lot of traditional Chinese as Japanese. Is there actually any Japanese in there? Of course Japanese shares some Chinese characters, kanji.

07-20-2022, 11:37 PM	#244
bookman156 Addict Posts: 368 Karma: 1000000 Join Date: Mar 2016 Device: none	Okay, I looked all the way through. There isn't any Japanese, it's all traditional Chinese. So that should facilitate finishing your project off.

07-21-2022, 11:19 PM	#252
bookman156 Addict Posts: 368 Karma: 1000000 Join Date: Mar 2016 Device: none	Fascinating. Traditional Chinese, some of which probably no-one on the planet knows how to pronounce. Also some unusual characters that are quite amazing designs. Most of the characters from the classics were done ages ago, probably we're getting on to obscure place-names and one-pig villages now.

07-22-2022, 08:13 AM	#253
Quoth the rook, bossing Never. Posts: 11,166 Karma: 85874891 Join Date: Jun 2017 Location: Ireland Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11	People at railway stations using notepads (or the temporary LCD like erasable wax) because they understand the written language, but they talk in their own language, so I'd imagine written Chinese isn't usually pronounced at all unless it's some deliberate phonetic transliteration. Maybe nearly 300 actually used spoken languages in China.

Advert

Advert