MobileRead Forums - View Single Post

KevinH · 02-17-2018, 01:21 PM

Directly from the docs (Wiki in this case):

Code:

?	The question mark indicates zero or one occurrences of the preceding element. For example, colou?r matches both "color" and "colour".
*	The asterisk indicates zero or more occurrences of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on.
+	The plus sign indicates one or more occurrences of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac".

So an "*" by definition will match 0 or more instances of the pattern preceding it. Matching 0 cases of of the pattern [a-zA-Z] and one or more cases of the pattern makes no sense as it matches everything.

To be clear from this example:

If I run "count all" using this regular expression [a-zA-Z]* on the following line:

Code:

<p> this is a line of text </p>

when the cursor is just before the first '<, I get no matches found. If I then advance the cursor to just before the "t" in this and then run "count all" I get 1 match found (the "this") but nothing afterwards.

If I instead change to something that is actually sensible to me:
[a-zA-Z]+

I find all of the (ascii) words in the file (with the cursor on the first line).

That type of re should only be used after an pattern so that it will actually find things not just everything.

And yes you can create re patterns that make no sense and that will work differently on different implementations of re.

If I wanted to get "words" I would instead use the following regular expression:

\w+

or

[a-zA-Z]+

which does exactly parse things into "words" no matter where the cursor starts.

02-17-2018, 01:21 PM	#25
KevinH Sigil Developer Posts: 8,810 Karma: 6000000 Join Date: Nov 2009 Device: many	Directly from the docs (Wiki in this case): Code: ? The question mark indicates zero or one occurrences of the preceding element. For example, colou?r matches both "color" and "colour". * The asterisk indicates zero or more occurrences of the preceding element. For example, abc matches "ac", "abc", "abbc", "abbbc", and so on. + The plus sign indicates one or more occurrences of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac". So an "" by definition will match 0 or more instances of the pattern preceding it. Matching 0 cases of of the pattern [a-zA-Z] and one or more cases of the pattern makes no sense as it matches everything. To be clear from this example: If I run "count all" using this regular expression [a-zA-Z]* on the following line: Code: <p> this is a line of text </p> when the cursor is just before the first '<, I get no matches found. If I then advance the cursor to just before the "t" in this and then run "count all" I get 1 match found (the "this") but nothing afterwards. If I instead change to something that is actually sensible to me: [a-zA-Z]+ I find all of the (ascii) words in the file (with the cursor on the first line). That type of re should only be used after an pattern so that it will actually find things not just everything. And yes you can create re patterns that make no sense and that will work differently on different implementations of re. If I wanted to get "words" I would instead use the following regular expression: \w+ or [a-zA-Z]+ which does exactly parse things into "words" no matter where the cursor starts.