MobileRead Forums - View Single Post - Selecting words with middle dot

KevinH · 12-12-2025, 09:01 AM

Is that Unicode “Middle Dot” (U+00b7) considered to be a member of the regular expression to match a word "\\w+" when the UnicodeProperty is set? That is how CodeView finds its word boundaries since the internal Qt functions fails to exclude all forms of quotes and does not follow unicode standards.

If it is not considered a unicode "word" character, it will be excluded as we now use QRegularExpression (\\w+), with UseUnicodeProperties set to extract the true unicode word out of the selected string of characters.

So in CodeView type a word with that middle dot in it, then use Sigil's find and replace set for regex search (make sure the unicode property flag is set) using that search expression and use find to determine if that unicode char is deemed to be a word character or not.

Update:

According to this cite: https://codepoints.net/U+00B7?lang=en
It is considered "inter-word" punctuation and its group is "Other Punctuation". It is not considered by this Unicode definition to be a character *inside* a word. (ie. inter not intra).

You may be using it in some other way but according to official unicode properties it is not considered part of a word.

12-12-2025, 09:01 AM	#4
KevinH Sigil Developer Posts: 9,282 Karma: 6686152 Join Date: Nov 2009 Device: many	Is that Unicode “Middle Dot” (U+00b7) considered to be a member of the regular expression to match a word "\\w+" when the UnicodeProperty is set? That is how CodeView finds its word boundaries since the internal Qt functions fails to exclude all forms of quotes and does not follow unicode standards. If it is not considered a unicode "word" character, it will be excluded as we now use QRegularExpression (\\w+), with UseUnicodeProperties set to extract the true unicode word out of the selected string of characters. So in CodeView type a word with that middle dot in it, then use Sigil's find and replace set for regex search (make sure the unicode property flag is set) using that search expression and use find to determine if that unicode char is deemed to be a word character or not. Update: According to this cite: https://codepoints.net/U+00B7?lang=en It is considered "inter-word" punctuation and its group is "Other Punctuation". It is not considered by this Unicode definition to be a character inside a word. (ie. inter not intra). You may be using it in some other way but according to official unicode properties it is not considered part of a word. Last edited by KevinH; 12-12-2025 at 10:53 AM.