MobileRead Forums - View Single Post - restricting regex to single lines of code?

Serpentine · 01-28-2012, 04:39 PM

Let us use the following test text:

Code:

This is an example paragraph of text
it is not very long, nor very correct.

But it should be enough.

Using the default of "dot does not match newline"
Consider the expression: .+
You will have three matches :

Code:

1(This is an example paragraph of text)
2(it is not very long, nor very correct.)

3(But it should be enough.)

Now consider the expression: .+\s+.+
You will have two matches:

Code:

1(This is an example paragraph of text)
it is not very long, nor very correct.)

2(But it should be enough.)

This is caused by searching explicitly for \s, which does match newline. Remember that the default in this case was that dot does NOT match newline. To allow dot to match newline, we use (?s).

If we consider: (?s).*
You will have one match:

Code:

1(This is an example paragraph of text)
it is not very long, nor very correct.

But it should be enough.)

So far it's simple enough, however it does not show the reason why I'm making sure that you take note of the \s matches specifically. A lot of expressions will need you to use \s+ or similar, however this will allow you to escape the 'single line', which is bad.

This is caused because by default the searched string is treated as a single long line. This means that it's effectively seens as :
[code]^This is an example paragraph of text\r\nit is not very long, nor very correct\.\r\n\r\nBut it should be enough\.$[code]

\s is going to match those \r and \n always. So, you need to be pretty careful with \s's either way, dot matches or not. Which is why there is multiline matching, which means that the anchors in the above text are moved back to their logical positions, rather than being at the start and end of the whole string, they will now match at the start and end of each line. Making it look more like :
[code]^This is an example paragraph of text$\r\n^it is not very long, nor very correct\.$\r\n\r\n^But it should be enough\.$[code]

So that we can more accuratly evaluate lines, for example - let us match a line, and the following line which starts with "it is not": (?m)^(.+)\s+^(it is not.+)$

Code:

1/1(This is an example paragraph of text)
1/2(it is not very long, nor very correct.)

But it should be enough.

1/2 being (first match, group 2)

True to the line restriction, there would not be a match if it were searched for it in:

Code:

This is an example paragraph of text it is not very long, nor very correct.

But it should be enough.

01-28-2012, 04:39 PM	#15
Serpentine Evangelist Posts: 416 Karma: 1045911 Join Date: Sep 2011 Location: Cape Town, South Africa Device: Kindle 3	Let us use the following test text: Code: This is an example paragraph of text it is not very long, nor very correct. But it should be enough. Using the default of "dot does not match newline" Consider the expression: .+ You will have three matches : Code: 1(This is an example paragraph of text) 2(it is not very long, nor very correct.) 3(But it should be enough.) Now consider the expression: .+\s+.+ You will have two matches: Code: 1(This is an example paragraph of text) it is not very long, nor very correct.) 2(But it should be enough.) This is caused by searching explicitly for \s, which does match newline. Remember that the default in this case was that dot does NOT match newline. To allow dot to match newline, we use (?s). If we consider: (?s).* You will have one match: Code: 1(This is an example paragraph of text) it is not very long, nor very correct. But it should be enough.) So far it's simple enough, however it does not show the reason why I'm making sure that you take note of the \s matches specifically. A lot of expressions will need you to use \s+ or similar, however this will allow you to escape the 'single line', which is bad. This is caused because by default the searched string is treated as a single long line. This means that it's effectively seens as : [code]^This is an example paragraph of text\r\nit is not very long, nor very correct\.\r\n\r\nBut it should be enough\.$[code] \s is going to match those \r and \n always. So, you need to be pretty careful with \s's either way, dot matches or not. Which is why there is multiline matching, which means that the anchors in the above text are moved back to their logical positions, rather than being at the start and end of the whole string, they will now match at the start and end of each line. Making it look more like : [code]^This is an example paragraph of text$\r\n^it is not very long, nor very correct\.$\r\n\r\n^But it should be enough\.$[code] So that we can more accuratly evaluate lines, for example - let us match a line, and the following line which starts with "it is not": (?m)^(.+)\s+^(it is not.+)$ Code: 1/1(This is an example paragraph of text) 1/2(it is not very long, nor very correct.) But it should be enough. 1/2 being (first match, group 2) True to the line restriction, there would not be a match if it were searched for it in: Code: This is an example paragraph of text it is not very long, nor very correct. But it should be enough.