Thread: RegEx & Unicode
View Single Post
Old 11-30-2011, 05:01 PM   #1
capnm
Groupie
capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'
 
Posts: 156
Karma: 10001
Join Date: Feb 2011
Device: sony
RegEx & Unicode

I've been using the following regex to abbreviate series names as initialisms:
Code:
\s*([a-zA-Z]|\d+\.?\d*)[a-z\']*\.?\s*

\1
Now that more & more of my series include unicode characters, I'm wondering if there is an easy way to either modify the [a-zA-Z] and [a-z'] terms to include appropriate accented characters, or to transliterate (transcode?) the string before regex processing.

Or is my best bet just to manually transcode my series? (yuck)

Last edited by capnm; 11-30-2011 at 11:04 PM. Reason: fixing typo I made while removing parentheses
capnm is offline   Reply With Quote