MobileRead Forums - View Single Post

DiapDealer · 02-06-2022, 12:43 PM

</?a ?([^>]+)?>

The question marks are used to mark what comes before as optional.

So </?a is saying that the slash before the 'a' tag is optional. That means it matches both "<a"and "</a".

Then comes the space, which is also made optional, meaning it will match "<a", or "<a ".

The ([^>]+)? is a little more tricky, but not terribly so. The parentheses are used to group everything before the last question mark. Meaning the whole of what's inside the parentheses is optional.

"[^>]" is a common character class when trying to parse html tags. It simply means that it will match any character that is not (^) the greater-than character (>). It's used to ensure that the expression does not get greedy and grab content beyond the ending of the current tag (>). The + is for repetition. + is one or more times, and * means 0 or more times.

The use of + in this case is why the grouping parentheses and the question mark to make the whole thing optional is necessary. In this particular case: the optional space character and the ([^>]+)? could be replaced with simply [^>]*
(meaning match all characters (except >) zero or more times, instead of all characters (except >) one or more times... optionally).

Then match the closing > character.

</?a ?([^>]+)?>

should be synonymous with:

</?a[^>]*>

for the stripping of all opening and closing anchor tags (as well as any self-closing anchor tags of the variety: <a id="anchor_tag_1" />)

But no need to change what works. I included the slight simplification for explanatory purposes.

02-06-2022, 12:43 PM	#699
DiapDealer Grand Sorcerer Posts: 29,049 Karma: 210162574 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	</?a ?([^>]+)?> The question marks are used to mark what comes before as optional. So </?a is saying that the slash before the 'a' tag is optional. That means it matches both "<a"and "</a". Then comes the space, which is also made optional, meaning it will match "<a", or "<a ". The ([^>]+)? is a little more tricky, but not terribly so. The parentheses are used to group everything before the last question mark. Meaning the whole of what's inside the parentheses is optional. "[^>]" is a common character class when trying to parse html tags. It simply means that it will match any character that is not (^) the greater-than character (>). It's used to ensure that the expression does not get greedy and grab content beyond the ending of the current tag (>). The + is for repetition. + is one or more times, and * means 0 or more times. The use of + in this case is why the grouping parentheses and the question mark to make the whole thing optional is necessary. In this particular case: the optional space character and the ([^>]+)? could be replaced with simply [^>]* (meaning match all characters (except >) zero or more times, instead of all characters (except >) one or more times... optionally). Then match the closing > character. </?a ?([^>]+)?> should be synonymous with: </?a[^>]> for the stripping of all opening and closing anchor tags (as well as any self-closing anchor tags of the variety: <a id="anchor_tag_1" />) But no need to change what works. I included the slight simplification for explanatory purposes. Last edited by DiapDealer; 02-06-2022 at 12:49 PM.*