MobileRead Forums - View Single Post

Tex2002ans · 07-21-2022, 01:20 AM

Quote:

Originally Posted by bookman156

Wow, I was wondering whether there was a good way to do that. But presumably it would take each character of a phrase rather than the whole phrase.

The '+' sign in regex means "ONE OR MORE of the previous thing".

So the regex I gave would:

Find "ONE OR MORE Chinese characters in a row."
When you replace it, "Wrap the entire chunk in a <span>".

Very similar to what was written in that Adobe InDesign GREP method in that article of yours:

Code:

[\x{2E80}-\x{9FBB}\x{3000}-\x{303F}\x{FF01}-\x{FF60}]+

There are 3 major parts. I'll explain the easy ones first.

The Easy Parts

Brackets and the plus sign are special symbols in regular expressions!

[]
- = "Look for a single character that matches what's in between the open/close brackets."
+
- = "Look for ONE OR MORE of the previous thing."

The "Hard" Parts

All of that \x{} stuff is a (scary-looking) way to search for specific Unicode characters.

For example, if you wrote a simple:

[a-z]
- In English, that means "Look for any character BETWEEN 'a' and 'z'."
  - So it would match 'a' OR 'b' OR 'c' OR ... OR 'y' OR 'z'.

Each letter in Unicode gets assigned numbers:

a = 0061
b = 0062
c = 0063
d = 0064
[...]
z = 007A

So when you write:

[\x{2E80}-\x{9FBB}]

it's saying:

"Look for any character between:

⺀ = 2E80
[...]
龻 = 9FBB

So everything in that GREP, sandwiched in between those brackets, is just a big long list of:

"Everything between these characters/numbers to those characters/numbers."

Sticking It All Together

[ = "Look for any of the characters that are between the open/close brackets."
- \x{2E80}-\x{9FBB}
- = "any character between 2E80 and 9FBB."
- \x{3000}-\x{303F}
- = "any character between 3000 and 303F."
- \x{FF01}-\x{FF60}
- = "any character between FF01 and FF60."
] = "That's the end of my giant list of characters."
+ = "Okay, now keep looking for ONE OR MORE of the characters in that giant list."

(I assume that's just a lot of codes for Chinese characters. Unsure how up-to-date or accurate it is though. Unicode, each year, is always getting updates/additions/revisions. Since that article was written in 2012, we've gone from Unicode 6.1 -> Unicode 14.0.)

Quote:

Originally Posted by bookman156

Here, I did notice this:

[...]

Thanks. I'll definitely have to go back and adjust my latest EPUB based on your notes.

I might be able to get to redoing that EPUB this weekend. (If not, then in a few weeks.)

Let's continue that discussion in a Private Message if needed.

Quote:

Originally Posted by bookman156

[...] you're tagging traditional Chinese as Japanese.

Yeah, I guessed the language based on the original font information from the DOC.

(In Post #3 of that thread, I explained which CJK Microsoft fonts seemed to be assigned to which language in legacy DOC documents.)

Quote:

Originally Posted by bookman156

Oddly enough the paper was on a subject I have written on myself, so that was kinda interesting.

Cool!

Yeah, I haven't read through those documents yet. I finished about 99%+ of the conversion, and was doing finishing touches before tossing it on my device to proofread.

Side Note: And, my friend, you have got to combine your posts instead of 100 little mini-ones. lol.