Thread: regex help
View Single Post
Old 01-21-2014, 08:04 PM   #6
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
instead of saying "(either not this | or not that)", you need to say "not (either this | or that)". Currently, if either condition is true, a match is made. A "p" in an html tag isn't in an entity, and vice versa, so of course it matches.

...


Here you go:
Code:
p(?!([^<&]+)?(>|;))
First we search for the "p" then we check ahead to see if the following is not there: any character string that does NOT contain either a "<" or "&" , followed by either a ">" or ";"

EDIT: If your sentence ends in a ";" this won't match, (you can try adding a negative lookbehind for a "&" and seeing where that leads you) so you're probably better off using calibre to convert all entities to unicode. And tags won't have that problem.

EDIT #2: added regex pipe and parentheses to beginning of answer, to clarify how it works.

Last edited by eschwartz; 01-22-2014 at 01:51 AM.
eschwartz is offline   Reply With Quote