07-20-2022, 10:54 PM | #241 | |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
No problem.
Quote:
In Sigil/Calibre, you can use regex to look for "Unicode properties": Code:
\p{Han} \p{Greek} That's a super advanced regex thing though. * * * Finding All Chinese Characters in Sigil/Calibre You could use this regex: Search: (\p{Han}+) Replace: <span class="chinese" lang="zh-Hant" xml:lang="zh-Hant">\1</span> and that would tag all Chinese words in a single shot.
- - - Side Note: In any ebook, I mostly just use Sigil's Tools > Reports > Characters in HTML Files and look for anything suspicious. Definitely a good habit to get into, because so many times authors sneak garbage in, like:
... and all sorts of other crap, especially when they paste from other documents or online sources. That report allows you to just see every single character used in the ebook, at a glance. - - - * * * Fonts are one of the biggest reasons why I began tagging languages. A few books I worked on had a handful of Polytonic Greek words, so it was easy to spot/tag them + they're one of the rare use-cases where you may want to embed a font. (Most fonts have basic Greek letters, but not all the ones with the crazy accents on them.) That's what started the snowball rolling. And here we are all these years later. Now I'm a fiend! I must admit, tagging languages is still a pain, and not THAT beneficial yet. (Sucks up lots of time with minimal benefit.) But, I do it on the easy stuff:
Trying to tag everything down to the word-level is... yeah... I described the real-life problems with that in the Reddit thread. Someone was complaining about LibreOffice "should just automatically detect/tag words... like Google Translate!!!". I believe I put him in his place lol. The way LibreOffice handles it though is pretty genius, and I never even thought of it: If your keyboard changes between languages, the cursor will swap languages too. For those who swap between multiple languages while typing their documents, that sounds a pretty decent compromise to me. * * * Do you speak or read Chinese? If so, I may still need your help in that 2020 "Should Chinese Fonts be Embedded?" thread. (That project has been sitting dormant for a few years, and since I don't read/write Chinese...) Did I tag the Simplified/Traditional Chinese characters properly? Does my sample EPUB characters match the PDF? (Or did the person who created the direct formatting in the DOC potentially botch it up? Because they did have the foreign characters marked as "French", so I don't trust them!!! lol.) Last edited by Tex2002ans; 07-20-2022 at 11:20 PM. |
|
07-20-2022, 11:23 PM | #242 | ||||
Addict
Posts: 368
Karma: 1000000
Join Date: Mar 2016
Device: none
|
Quote:
Quote:
Quote:
Code:
Liu E, also known as Liu Tieyun <span class="japanese" lang="ja" xml:lang="ja">劉鐵雲</span>, was born in 1857 at Liuhe <span class="japanese" lang="ja" xml:lang="ja">六合</span> county in what is today Nanjing <span class="japanese" lang="ja" xml:lang="ja">南京</span>. As a general rule, an English paper about ancient China or history with Chinese characters will be be traditional Chinese. Quote:
|
||||
07-20-2022, 11:35 PM | #243 |
Addict
Posts: 368
Karma: 1000000
Join Date: Mar 2016
Device: none
|
Actually you've tagged a lot of traditional Chinese as Japanese. Is there actually any Japanese in there? Of course Japanese shares some Chinese characters, kanji.
|
07-20-2022, 11:37 PM | #244 |
Addict
Posts: 368
Karma: 1000000
Join Date: Mar 2016
Device: none
|
Okay, I looked all the way through. There isn't any Japanese, it's all traditional Chinese. So that should facilitate finishing your project off.
|
07-20-2022, 11:49 PM | #245 | ||
Addict
Posts: 368
Karma: 1000000
Join Date: Mar 2016
Device: none
|
No, there is one little bit of Japanese:
Quote:
Quote:
Last edited by bookman156; 07-20-2022 at 11:53 PM. |
||
07-21-2022, 12:08 AM | #246 | |
Addict
Posts: 368
Karma: 1000000
Join Date: Mar 2016
Device: none
|
In this:
Quote:
|
|
07-21-2022, 01:20 AM | #247 | ||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
So the regex I gave would:
Very similar to what was written in that Adobe InDesign GREP method in that article of yours: Code:
[\x{2E80}-\x{9FBB}\x{3000}-\x{303F}\x{FF01}-\x{FF60}]+ The Easy Parts Brackets and the plus sign are special symbols in regular expressions!
The "Hard" Parts All of that \x{} stuff is a (scary-looking) way to search for specific Unicode characters. For example, if you wrote a simple:
Each letter in Unicode gets assigned numbers:
So when you write:
it's saying: "Look for any character between:
So everything in that GREP, sandwiched in between those brackets, is just a big long list of:
Sticking It All Together
(I assume that's just a lot of codes for Chinese characters. Unsure how up-to-date or accurate it is though. Unicode, each year, is always getting updates/additions/revisions. Since that article was written in 2012, we've gone from Unicode 6.1 -> Unicode 14.0.) Thanks. I'll definitely have to go back and adjust my latest EPUB based on your notes. I might be able to get to redoing that EPUB this weekend. (If not, then in a few weeks.) Let's continue that discussion in a Private Message if needed. Yeah, I guessed the language based on the original font information from the DOC. (In Post #3 of that thread, I explained which CJK Microsoft fonts seemed to be assigned to which language in legacy DOC documents.) Quote:
Yeah, I haven't read through those documents yet. I finished about 99%+ of the conversion, and was doing finishing touches before tossing it on my device to proofread. Side Note: And, my friend, you have got to combine your posts instead of 100 little mini-ones. lol. Last edited by Tex2002ans; 07-22-2022 at 03:02 PM. |
||
07-21-2022, 01:50 AM | #248 | |
Addict
Posts: 368
Karma: 1000000
Join Date: Mar 2016
Device: none
|
Quote:
I haven't yet exported an InDesign file with Chinese as EPUB, but the Chinese would have a span name so I guess that could be replaced easily in Atom or something with the span containing the language tags. Not sure if InDesign would put the span around individual characters or phrases. |
|
07-21-2022, 03:11 PM | #249 | ||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
(~5,700 in Unicode 8.0 + ~5,000 in Unicode 13.0.) (And ~4,200 more CJK characters are going to be added in Unicode 15.0, which will be coming out later this year.) That \x{} numbers method would fail, if it doesn't cover all those new cases. Where \p{Han} would detect all characters, as long as the program understands the latest Unicode. Quote:
But if you want to make your life easier... Make sure you create a Character Style. You can:
That will make it much easier to convert to clean HTML <span>+classes. (InDesign also has this great thing called "Style Mapping" which is an enormous help too... if you use your Styles properly!) - - - - - Complete Side Note: One really ugly thing I just learned in Microsoft Word. If you type a link, like: Code:
http://www.example.com/ Code:
http://www.exa123mple.com/
and points all 3 pieces to the same exact URL. So instead of this in your HTML: Code:
<a href="http://www.exa123mple.com/">http://www.exa123mple.com/</a> Code:
<a href="http://www.exa123mple.com/">http://www.exa</a><a href="http://www.exa123mple.com/">123</a><a href="http://www.exa123mple.com/">mple.com/</a> I'm betting InDesign has all sorts of mess like that too. And this could explain some of the real disastrous documents I've gotten, where there are millions of overlapping <span>s which seem to all be the same code. Last edited by Tex2002ans; 07-21-2022 at 03:18 PM. |
||
07-21-2022, 07:01 PM | #250 | |
Addict
Posts: 368
Karma: 1000000
Join Date: Mar 2016
Device: none
|
Quote:
Yes, certainly I make a character style for Chinese. Then that will become the style name of the exported span class. Last edited by bookman156; 07-21-2022 at 07:11 PM. |
|
07-21-2022, 10:16 PM | #251 | ||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Most people don't use Styles, so just wanted to make sure. Here's all the characters coming down the pipeline in Unicode 15.0: https://www.unicode.org/charts/PDF/Unicode-15.0/ Looks like it's all in the "CJK Unified Ideographs Extension H" section (click for PDF). Similar can be found for:
If you visit that page + click on the categories, you can get PDFs showing you every newly accepted character highlighted in yellow. They also have the page: which lists/describes all the CJK sources in detail. (Looks like 11.0 + 13.0 added lots of characters from a 2013 document published by the Chinese government.) Quote:
And the characters must exist in some authoritative source somewhere, no matter how rare. Unicode has quite a high bar to get new ones accepted. Heck, it was only in 2009 (Unicode 5.2) when Egyptian Hieroglyphics were added! |
||
07-21-2022, 11:19 PM | #252 |
Addict
Posts: 368
Karma: 1000000
Join Date: Mar 2016
Device: none
|
Fascinating. Traditional Chinese, some of which probably no-one on the planet knows how to pronounce. Also some unusual characters that are quite amazing designs. Most of the characters from the classics were done ages ago, probably we're getting on to obscure place-names and one-pig villages now.
|
07-22-2022, 08:13 AM | #253 |
the rook, bossing Never.
Posts: 11,166
Karma: 85874891
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
People at railway stations using notepads (or the temporary LCD like erasable wax) because they understand the written language, but they talk in their own language, so I'd imagine written Chinese isn't usually pronounced at all unless it's some deliberate phonetic transliteration. Maybe nearly 300 actually used spoken languages in China.
|
07-22-2022, 10:16 AM | #254 |
Addict
Posts: 368
Karma: 1000000
Join Date: Mar 2016
Device: none
|
The written language is spoken, international pinyin represents its pronunciation (Beijing, rather than Peking in the old Wade-Giles system, for example), though Cantonese is spoken differently. But there are plenty who speak Chinese who can't read or write it.
There's a saying in China that if you travel a hundred miles no-one will understand you. That's why people get out their notebooks to draw the characters. Same as me saying 'Spell it' when I can't understand someone's dialect. Last edited by bookman156; 07-22-2022 at 10:26 AM. |
Tags |
semantic markup |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Span Span Span Sigil cleaning up indesign | blackest | Sigil | 31 | 12-06-2017 10:16 AM |
Span Span Span Span | MULTIVAC | ePub | 7 | 12-06-2014 08:58 AM |
Nested Span? | Turtle91 | ePub | 4 | 05-20-2013 02:47 PM |
span in span: is this problematic? | tbuyus | ePub | 8 | 03-31-2013 08:01 AM |
Remove <br /> together with span, and only span | Razzia | Recipes | 3 | 05-30-2011 06:55 PM |