I've been working on the word list and the regex associated with them. The range [a-z]* gives too broad a match and will result in too many false positive returns. I've converted these to repetition matches in greedy mode. i.e. app[yines]{0,5} will match appy and appiness, and e[emr]{0,1} will match e, ee, em, er. Still not 100% perfect but much better then before. There are still issues with false positives for text like ‘At the sound of’ and ‘Is it real’ where you need to lookaround the text and that is the next improvement.
Probably should be stated that this regex works for double curly quote formatted books but not single - and I prefer to step through the file rather than a replace all.
Quote:
[ ]?‘(?i)(\d\d|ad[n]{0,1}|app[yines]{0,5}|appen[eds]{0,2}|ard[er]{0,2}|arf|alf|ang|as|at|av[ein]{0,3}|bout|bye|cause|cept[ing]{0,3}|copter[s]{0,1}|cos|cross|cuz|couse|e[emr]{0,1}|ell|elp[edling]{0,5}|ere[abouts]{0,5}|eard|f|fraid|fore|id|igh[er]{0,2}|ighness|im|is|isself|gainst|kay|less|mongst| n|nd|neath|nough|nother|nuff|o[o]{0,1}|ood|ome|ow|op[eding]{0,3}|oney|orse[flesh]{0,5}|ouse[ds]{0,1}|pon|puter[edrs]{0,2}|round|scuse[ds]{0,1}|spect[sed]{0,2}|scaped|sides|tween|special[ly]{0,2}|stead|t|taint|til|tis|twas|twere|twould|twil l|ud|un|urt|vise)([\p{P}|\s])
|