View Single Post
Old 07-21-2022, 01:20 AM   #247
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by bookman156 View Post
Wow, I was wondering whether there was a good way to do that. But presumably it would take each character of a phrase rather than the whole phrase.
The '+' sign in regex means "ONE OR MORE of the previous thing".

So the regex I gave would:
  • Find "ONE OR MORE Chinese characters in a row."
  • When you replace it, "Wrap the entire chunk in a <span>".

Very similar to what was written in that Adobe InDesign GREP method in that article of yours:

Code:
[\x{2E80}-\x{9FBB}\x{3000}-\x{303F}\x{FF01}-\x{FF60}]+
There are 3 major parts. I'll explain the easy ones first.

The Easy Parts

Brackets and the plus sign are special symbols in regular expressions!
  • []
    • = "Look for a single character that matches what's in between the open/close brackets."
  • +
    • = "Look for ONE OR MORE of the previous thing."

The "Hard" Parts

All of that \x{} stuff is a (scary-looking) way to search for specific Unicode characters.

For example, if you wrote a simple:
  • [a-z]
    • In English, that means "Look for any character BETWEEN 'a' and 'z'."
      • So it would match 'a' OR 'b' OR 'c' OR ... OR 'y' OR 'z'.

Each letter in Unicode gets assigned numbers:
  • a = 0061
  • b = 0062
  • c = 0063
  • d = 0064
  • [...]
  • z = 007A

So when you write:
  • [\x{2E80}-\x{9FBB}]

it's saying:

"Look for any character between:
  • ⺀ = 2E80
  • [...]
  • 龻 = 9FBB

So everything in that GREP, sandwiched in between those brackets, is just a big long list of:
  • "Everything between these characters/numbers to those characters/numbers."

Sticking It All Together
  • [ = "Look for any of the characters that are between the open/close brackets."
    • \x{2E80}-\x{9FBB}
    • = "any character between 2E80 and 9FBB."
    • \x{3000}-\x{303F}
    • = "any character between 3000 and 303F."
    • \x{FF01}-\x{FF60}
    • = "any character between FF01 and FF60."
  • ] = "That's the end of my giant list of characters."
  • + = "Okay, now keep looking for ONE OR MORE of the characters in that giant list."

(I assume that's just a lot of codes for Chinese characters. Unsure how up-to-date or accurate it is though. Unicode, each year, is always getting updates/additions/revisions. Since that article was written in 2012, we've gone from Unicode 6.1 -> Unicode 14.0.)

Quote:
Originally Posted by bookman156 View Post
Here, I did notice this:

[...]
Thanks. I'll definitely have to go back and adjust my latest EPUB based on your notes.

I might be able to get to redoing that EPUB this weekend. (If not, then in a few weeks.)

Let's continue that discussion in a Private Message if needed.

Quote:
Originally Posted by bookman156 View Post
[...] you're tagging traditional Chinese as Japanese.
Yeah, I guessed the language based on the original font information from the DOC.

(In Post #3 of that thread, I explained which CJK Microsoft fonts seemed to be assigned to which language in legacy DOC documents.)

Quote:
Originally Posted by bookman156 View Post
Oddly enough the paper was on a subject I have written on myself, so that was kinda interesting.
Cool!

Yeah, I haven't read through those documents yet. I finished about 99%+ of the conversion, and was doing finishing touches before tossing it on my device to proofread.

Side Note: And, my friend, you have got to combine your posts instead of 100 little mini-ones. lol.

Last edited by Tex2002ans; 07-22-2022 at 03:02 PM.
Tex2002ans is offline   Reply With Quote