Quote:
Originally Posted by Man Eating Duck
* Apart from regex being complicated to grasp for casual users, it is also theoretically impossible to reliably parse html with regex. I won't go into much detail, but a trivial example:
<p>
<span class="empty">A paragraph with <span class="italic">italics</span> in it.</span>
</p>
I've actually seen this very structure in the wild, with a corresponding .empty{}. If you want to remove the useless "empty" spans, an intuitive regex might be something like (?U)<span class="empty">(.*)</span>, replace with /1. In the example above this would extend the italic span to encompass the rest of the paragraph.
|
Which is why you would include the closing </p> in the match to make sure you only got the all encompassing span.
Code:
(?U)<span class="empty">(.*)</span>\s+</p>
Replace with: \1\n</p>
I'm not arguing that a true parser wouldn't do a more effective (safer) job. It would. I just don't think it would be a very simple task to provide an end user with a configurable, flexible interface to the parser in order to
inform it of their desires (without actually writing code themselves).