View Single Post
Old 12-23-2015, 05:24 PM   #154
Steadyhands
Connoisseur
Steadyhands began at the beginning.
 
Steadyhands's Avatar
 
Posts: 57
Karma: 10
Join Date: Dec 2011
Device: Samsung Tablet
I've been working on the word list and the regex associated with them. The range [a-z]* gives too broad a match and will result in too many false positive returns. I've converted these to repetition matches in greedy mode. i.e. app[yines]{0,5} will match appy and appiness, and e[emr]{0,1} will match e, ee, em, er. Still not 100% perfect but much better then before. There are still issues with false positives for text like ‘At the sound of’ and ‘Is it real’ where you need to lookaround the text and that is the next improvement.

Probably should be stated that this regex works for double curly quote formatted books but not single - and I prefer to step through the file rather than a replace all.

Quote:
[ ]?‘(?i)(\d\d|ad[n]{0,1}|app[yines]{0,5}|appen[eds]{0,2}|ard[er]{0,2}|arf|alf|ang|as|at|av[ein]{0,3}|bout|bye|cause|cept[ing]{0,3}|copter[s]{0,1}|cos|cross|cuz|couse|e[emr]{0,1}|ell|elp[edling]{0,5}|ere[abouts]{0,5}|eard|f|fraid|fore|id|igh[er]{0,2}|ighness|im|is|isself|gainst|kay|less|mongst| n|nd|neath|nough|nother|nuff|o[o]{0,1}|ood|ome|ow|op[eding]{0,3}|oney|orse[flesh]{0,5}|ouse[ds]{0,1}|pon|puter[edrs]{0,2}|round|scuse[ds]{0,1}|spect[sed]{0,2}|scaped|sides|tween|special[ly]{0,2}|stead|t|taint|til|tis|twas|twere|twould|twil l|ud|un|urt|vise)([\p{P}|\s])
Steadyhands is offline   Reply With Quote