View Single Post
Old 07-20-2022, 10:54 PM   #241
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by bookman156 View Post
Fantastic, that's useful.
No problem.

Quote:
Originally Posted by bookman156 View Post
By the way, if you set Chinese with English in books with InDesign here's a fantastic technique that's good to know about. It automatically finds the Chinese and puts it in your desired Chinese font using GREP (I prefer 'Noto Sans CJK TC' for traditional Chinese):

https://www.scammell.co.uk/2012/04/2...dobe-indesign/
Thanks.

In Sigil/Calibre, you can use regex to look for "Unicode properties":

Code:
\p{Han}
\p{Greek}
1st would look for Chinese characters, 2nd matches Greek characters.

That's a super advanced regex thing though.

* * *

Finding All Chinese Characters in Sigil/Calibre

You could use this regex:

Search: (\p{Han}+)
Replace: <span class="chinese" lang="zh-Hant" xml:lang="zh-Hant">\1</span>

and that would tag all Chinese words in a single shot.
  • zh-Hans = Chinese (Simplified)
  • zh-Hant = Chinese (Traditional)

- - -

Side Note: In any ebook, I mostly just use Sigil's Tools > Reports > Characters in HTML Files and look for anything suspicious.

Definitely a good habit to get into, because so many times authors sneak garbage in, like:

... and all sorts of other crap, especially when they paste from other documents or online sources.

That report allows you to just see every single character used in the ebook, at a glance.

- - -

* * *

Fonts are one of the biggest reasons why I began tagging languages.

A few books I worked on had a handful of Polytonic Greek words, so it was easy to spot/tag them + they're one of the rare use-cases where you may want to embed a font.

(Most fonts have basic Greek letters, but not all the ones with the crazy accents on them.)

That's what started the snowball rolling.

And here we are all these years later. Now I'm a fiend!

I must admit, tagging languages is still a pain, and not THAT beneficial yet. (Sucks up lots of time with minimal benefit.) But, I do it on the easy stuff:
  • Chapters
    • British article in an otherwise American journal.
  • Entire paragraph/blockquotes
  • Poems/Lyrics
  • Handful of words in completely different alphabets
    • Greek, Chinese/Japanese, etc.
    • (Very easy to apply fonts + subset them.)

Trying to tag everything down to the word-level is... yeah... I described the real-life problems with that in the Reddit thread.

Someone was complaining about LibreOffice "should just automatically detect/tag words... like Google Translate!!!". I believe I put him in his place lol.

The way LibreOffice handles it though is pretty genius, and I never even thought of it:

If your keyboard changes between languages, the cursor will swap languages too.

For those who swap between multiple languages while typing their documents, that sounds a pretty decent compromise to me.

* * *

Do you speak or read Chinese?

If so, I may still need your help in that 2020 "Should Chinese Fonts be Embedded?" thread.

(That project has been sitting dormant for a few years, and since I don't read/write Chinese...)

Did I tag the Simplified/Traditional Chinese characters properly?

Does my sample EPUB characters match the PDF? (Or did the person who created the direct formatting in the DOC potentially botch it up? Because they did have the foreign characters marked as "French", so I don't trust them!!! lol.)

Last edited by Tex2002ans; 07-20-2022 at 11:20 PM.
Tex2002ans is offline   Reply With Quote