View Single Post
Old 10-29-2015, 08:41 AM   #2
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 28,375
Karma: 203720150
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
First off: Sigil Preview should render multiple space characters as a single space. That's the way all (x)html works. If it's not, it means that 1) the double spaces are inside a <pre> tag which indicates all whitespace is to be preserved; 2) same as #1, but "pre" is assigned through css (adobe products are notorious for this); or 3) the space characters are special no-breaking unicode characters. A fourth scenario is that spaces are being converted to &nbsp; entities when pasting with formatting. Look in code view to check.

But all that aside ... the Captain Overkill in me, would use something like:
Code:
(*UCP)\b[^\S\p{Zl}\p{Zp}\n\r\t]{2,}\b
and replace what it matches with a single space character.

But that's just me.

And even that's still not going to work for situations like:

Code:
Sometimes_punctuation_like_this,__will_screw_things_up.
In that case, I'd use something like:
Code:
(*UCP)(\b|\p{P})[^\S\p{Zl}\p{Zp}\n\r\t]{2,}(\b|\p{P})
After I did the initial find and replace (no replace expression given for that one, by the way. It complicates things).

Though that may not always achieve the desired result--depending on the text. The bottom line is: don't blindly do a replace all. Step through each instance and verify the replace.

It basically looks for word boundaries (\b - made unicode aware by (*UCP)) and looks for two or more consecutive whitespace characters (not including any newlines, returns, tabs, or unicode paragraph/line separators) between them.

** And yes ... that's "NOT not whitespace" logic in there.

People who think they don't have to worry about any possible unicode characters or punctuation issues could probably get away with:
Code:
\b[^\S\n\r\t]{2,}\b
** None of my regex will work if the whitespace is being achieved with html entities.

Last edited by DiapDealer; 10-29-2015 at 09:11 AM.
DiapDealer is offline   Reply With Quote