Quote:
Originally Posted by bookman156
Wow, I was wondering whether there was a good way to do that. But presumably it would take each character of a phrase rather than the whole phrase.
|
The '+' sign in regex means "ONE OR MORE of the previous thing".
So the regex I gave would:
- Find "ONE OR MORE Chinese characters in a row."
- When you replace it, "Wrap the entire chunk in a <span>".
Very similar to what was written in that Adobe InDesign GREP method in that article of yours:
Code:
[\x{2E80}-\x{9FBB}\x{3000}-\x{303F}\x{FF01}-\x{FF60}]+
There are 3 major parts. I'll explain the easy ones first.
The Easy Parts
Brackets and the plus sign are special symbols in regular expressions!
- []
- = "Look for a single character that matches what's in between the open/close brackets."
- +
- = "Look for ONE OR MORE of the previous thing."
The "Hard" Parts
All of that
\x{} stuff is a (scary-looking) way to search for specific Unicode characters.
For example, if you wrote a simple:
- [a-z]
- In English, that means "Look for any character BETWEEN 'a' and 'z'."
- So it would match 'a' OR 'b' OR 'c' OR ... OR 'y' OR 'z'.
Each letter in Unicode gets assigned numbers:
- a = 0061
- b = 0062
- c = 0063
- d = 0064
- [...]
- z = 007A
So when you write:
it's saying:
"Look for any character between:
So everything in that GREP, sandwiched in between those brackets, is just a big long list of:
- "Everything between these characters/numbers to those characters/numbers."
Sticking It All Together
- [ = "Look for any of the characters that are between the open/close brackets."
- \x{2E80}-\x{9FBB}
- = "any character between 2E80 and 9FBB."
- \x{3000}-\x{303F}
- = "any character between 3000 and 303F."
- \x{FF01}-\x{FF60}
- = "any character between FF01 and FF60."
- ] = "That's the end of my giant list of characters."
- + = "Okay, now keep looking for ONE OR MORE of the characters in that giant list."
(I assume that's just a lot of codes for Chinese characters. Unsure how up-to-date or accurate it is though. Unicode, each year, is always getting updates/additions/revisions. Since that article was written in 2012, we've gone from Unicode 6.1 -> Unicode 14.0.)
Quote:
Originally Posted by bookman156
Here, I did notice this:
[...]
|
Thanks. I'll definitely have to go back and adjust my latest EPUB based on your notes.
I might be able to get to redoing that EPUB this weekend. (If not, then in a few weeks.)
Let's continue that discussion in a Private Message if needed.
Quote:
Originally Posted by bookman156
[...] you're tagging traditional Chinese as Japanese.
|
Yeah, I guessed the language based on the original font information from the DOC.
(
In Post #3 of that thread, I explained which CJK Microsoft fonts seemed to be assigned to which language in legacy DOC documents.)
Quote:
Originally Posted by bookman156
Oddly enough the paper was on a subject I have written on myself, so that was kinda interesting.
|

Cool!
Yeah, I haven't read through those documents yet. I finished about 99%+ of the conversion, and was doing finishing touches before tossing it on my device to proofread.
Side Note: And, my friend, you have
got to combine your posts instead of 100 little mini-ones. lol.