MobileRead Forums - View Single Post - How to make regex to replace 2 spaces between words, with one space?

DiapDealer · 10-29-2015, 08:41 AM

First off: Sigil Preview should render multiple space characters as a single space. That's the way all (x)html works. If it's not, it means that 1) the double spaces are inside a <pre> tag which indicates all whitespace is to be preserved; 2) same as #1, but "pre" is assigned through css (adobe products are notorious for this); or 3) the space characters are special no-breaking unicode characters. A fourth scenario is that spaces are being converted to   entities when pasting with formatting. Look in code view to check.

But all that aside ... the Captain Overkill in me, would use something like:

Code:

(*UCP)\b[^\S\p{Zl}\p{Zp}\n\r\t]{2,}\b

and replace what it matches with a single space character.

But that's just me.

And even that's still not going to work for situations like:

Code:

Sometimes_punctuation_like_this,__will_screw_things_up.

In that case, I'd use something like:

Code:

(*UCP)(\b|\p{P})[^\S\p{Zl}\p{Zp}\n\r\t]{2,}(\b|\p{P})

After I did the initial find and replace (no replace expression given for that one, by the way. It complicates things).

Though that may not always achieve the desired result--depending on the text. The bottom line is: don't blindly do a replace all. Step through each instance and verify the replace.

It basically looks for word boundaries (\b - made unicode aware by (*UCP)) and looks for two or more consecutive whitespace characters (not including any newlines, returns, tabs, or unicode paragraph/line separators) between them.

** And yes ... that's "NOT not whitespace" logic in there.

People who think they don't have to worry about any possible unicode characters or punctuation issues could probably get away with:

Code:

\b[^\S\n\r\t]{2,}\b

** None of my regex will work if the whitespace is being achieved with html entities.

10-29-2015, 08:41 AM	#2
DiapDealer Grand Sorcerer Posts: 28,601 Karma: 204624552 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	First off: Sigil Preview should render multiple space characters as a single space. That's the way all (x)html works. If it's not, it means that 1) the double spaces are inside a <pre> tag which indicates all whitespace is to be preserved; 2) same as #1, but "pre" is assigned through css (adobe products are notorious for this); or 3) the space characters are special no-breaking unicode characters. A fourth scenario is that spaces are being converted to   entities when pasting with formatting. Look in code view to check. But all that aside ... the Captain Overkill in me, would use something like: Code: (UCP)\b[^\S\p{Zl}\p{Zp}\n\r\t]{2,}\b and replace what it matches with a single space character. But that's just me. And even that's still not going to work for situations like: Code: Sometimes_punctuation_like_this,__will_screw_things_up. In that case, I'd use something like: Code: (UCP)(\b\|\p{P})[^\S\p{Zl}\p{Zp}\n\r\t]{2,}(\b\|\p{P}) After I did the initial find and replace (no replace expression given for that one, by the way. It complicates things). Though that may not always achieve the desired result--depending on the text. The bottom line is: don't blindly do a replace all. Step through each instance and verify the replace. It basically looks for word boundaries (\b - made unicode aware by (UCP)) and looks for two or more consecutive whitespace characters (not including any newlines, returns, tabs, or unicode paragraph/line separators) between them. * And yes ... that's "NOT not whitespace" logic in there. People who think they don't have to worry about any possible unicode characters or punctuation issues could probably get away with: Code: \b[^\S\n\r\t]{2,}\b ** None of my regex will work if the whitespace is being achieved with html entities. Last edited by DiapDealer; 10-29-2015 at 09:11 AM.