Quote:
Originally Posted by elibrarian
When I use regex to search for the full danish alphabet, I usually use [a-zæøå] or [A-ZÆØÅ]. Which of course doesn't find any other characters, accented or not, but they would not be part of the danish alphabet anyway ...
|
I find characters in english language books that are not from the english alphabet all the time... does this never happen in the danish books?
Why not just use \p{L} and catch all potential unicode letters? That's more than likely what people are
intending to catch when they use [A-Za-z] anyway (whether they consciously realize it or not). Or do people purposely mean to exclude certain characters that occur in words like café or façade or naïve? Just a thought.
I just know
I've found that when using "letters" for search criteria in a regexp on an english language text... thinking strictly in terms of "
english letters" will often produce results I didn't really intend. The original topic of this thread is a perfect example of this. So I've learned to approach Regex Find & Replace from a "unicode first" frame of mind when it comes to ebooks.