View Single Post
Old 11-24-2014, 10:18 AM   #20
trekky0623
Member
trekky0623 began at the beginning.
 
Posts: 20
Karma: 10
Join Date: Apr 2013
Device: Kindle Paperwhite
Quote:
Originally Posted by EbokJunkie View Post
What about creation of temporary copy of each file with soft hyphens stripped?
Stripping out soft hyphens would mess up the locations of the terms it finds.

But does C# support regex search? Why not search for aliases like this:

alias: Nessarose

search:N\x{00AD}*e\x{00AD}*s\x{00AD}*s\x{00AD}*a\x {00AD}*r\x{00AD}*o\x{00AD}*s\x{00AD}*e

Which will match soft hyphens 0 or more times between each letter, guaranteeing to find every instance of that term regardless of soft hyphens included in it.


If you wanted to be absolutely sure, you could do some more fancy regex:

search:
Code:
N(\x{00AD}|&​shy;|&#​173;|&#​xad;|&#​0173;|&#​x00AD;)*e(\x{00AD}|&​shy;|&#​173;|&#​xad;|&#​0173;|&#​x00AD;)*s(\x{00AD}|&​shy;|&#​173;|&#​xad;|&#​0173;|&#​x00AD;)*s(\x{00AD}|&​shy;|&#​173;|&#​xad;|&#​0173;|&#​x00AD;)*a(\x{00AD}|&​shy;|&#​173;|&#​xad;|&#​0173;|&#​x00AD;)*r(\x{00AD}|&​shy;|&#​173;|&#​xad;|&#​0173;|&#​x00AD;)*o(\x{00AD}|&​shy;|&#​173;|&#​xad;|&#​0173;|&#​x00AD;)*s(\x{00AD}|&​shy;|&#​173;|&#​xad;|&#​0173;|&#​x00AD;)*e
The key part being the insertion of:

Code:
(\x{00AD}|&​shy;|&#​173;|&#​xad;|&#​0173;|&#​x00AD;)*
between every letter that finds either the literal unicode soft hyphen symbol or the strings &​shy;, &#​173;, &#​xad;, &#​0173;, or &#​x00AD;.

If C# supports inline mode changes like Perl, you could even make that string case-insensitive while preserving the case sensitivity of the alias:

Code:
((?i)\x{00AD}|&​shy;|&​#173;|&​#xad;|&​#0173;|&​#x00AD;(?-i))*

Last edited by trekky0623; 11-24-2014 at 10:53 AM.
trekky0623 is offline   Reply With Quote